Statistics
| Revision:

## root / hw5 / hw5.txt @ 44

 1 ```HW5 ``` ```Chris Mar (cmar) ``` ```Kevin Woo (kwoo) ``` ```We worked together on all parts. ``` ```a. ``` ```sim_CPI 0.5901 # cycles per instruction ``` ```dl1.miss_rate 0.1019 # miss rate (i.e., misses/ref) ``` ```dcache_power_cc3 8665100.5508 # total power usage of dcache_cc3 ``` ```dcache2_power_cc3 7120581.8623 # total power usage of dcache2_cc3 ``` ```total_power_cycle_cc3 256132764.8597 # total power per cycle_cc3 ``` ```b. ``` ```sim_CPI 0.4959 # cycles per instruction ``` ```dl1.miss_rate 0.0339 # miss rate (i.e., misses/ref) ``` ```dcache_power_cc3 8167667.0943 # total power usage of dcache_cc3 ``` ```dcache2_power_cc3 5450757.1383 # total power usage of dcache2_cc3 ``` ```total_power_cycle_cc3 241974837.9655 # total power per cycle_cc3 ``` ```By storing the B matrix column wise instead of row-wise, we take a penalty ``` ```when writing to the matrix but more than make up for it when we read the ``` ```result. This is because when computing the C matrix we access B many times ``` ```and transverse B by column. By storing it in a column wise, the columns ``` ```remain in cache and allow us to have less cache misses. This drops the ``` ```total power used because there are less misses to the cache. Overall, we ``` ```see a 6% decrease in the miss rate, 24% decrease in dcache power, and a 5% ``` ```decrease in dcache2 power. The total power of the system decrease by 5.5%. ``` ```Our CPI and missrate also decrease. ``` ```c. ``` ```sim_CPI 0.6530 # cycles per instruction ``` ```dl1.miss_rate 0.1258 # miss rate (i.e., misses/ref) ``` ```dcache_power_cc3 7243250.5357 # total power usage of dcache_cc3 ``` ```dcache2_power_cc3 6618676.5259 # total power usage of dcache2_cc3 ``` ```total_power_cycle_cc3 218454713.4478 # total power per cycle_cc3 ``` ```We tried loop unrolling at 2, 4, 8, 16, 32, an 64 times. The results shown ``` ```above are from unrolling 32 times within the inner loop of the matrix ``` ```multiplication. Unrolling 32 times presented the optimal point because power ``` ```decreased at all points before and increased when we tried unrolling by 64. ``` ```There was a decrease in the statistics for CPI and increase in the d1 cache ``` ```miss rate because the number of d1 cache misses remained the same, but the ``` ```total number of cache accesses decreased. The L1 cache power decreased by 16.4% ``` ```and L2 cache power decreased by 7%. The total power of the system is reduced by ``` ```14.7%, showing a drastic improvement in power consumption. ``` ```d. ``` ```sim_CPI 0.5610 # cycles per instruction ``` ```dl1.miss_rate 0.0852 # miss rate (i.e., misses/ref) ``` ```dcache_power_cc3 15912831.7737 # total power usage of dcache_cc3 ``` ```dcache2_power_cc3 13651577.5999 # total power usage of dcache2_cc3 ``` ```total_power_cycle_cc3 500329672.6200 # total power per cycle_cc3 ``` ```We compare this to the original but remove the optmization of storing the ``` ```intermediate results to s. Instead, we accumulate in the C[i][j] array ``` ```so that the comparison is more fair between the two. These are the baseline ``` ```numbers for the matrix multiply with the s variable removed: ``` ```sim_CPI 0.5145 # cycles per instruction ``` ```dl1.miss_rate 0.0923 # miss rate (i.e., misses/ref) ``` ```dcache_power_cc3 12035597.9925 # total power usage of dcache_cc3 ``` ```dcache2_power_cc3 10016792.3405 # total power usage of dcache2_cc3 ``` ```total_power_cycle_cc3 386723845.9954 # total power per cycle_cc3 ``` ```These results were calculated with a tile size of 128, meaning that we ``` ```essentially had no tiling. This tile size resulted in the lowest dcache and ``` ```dcache2 numbers. We tested tile sizes from 2 - 128 in powers of 2. We see ``` ```an 9% increase in performance in terms of CPI and a 8% decrease in the ``` ```cache miss rate. However, these performance benefits come at the expense ``` ```of power, resulting in 32% increase in dcache level 1 power, 36% increase ``` ```in dcache level 2 power, and 29% increase in total core power. These numbers ``` ```are much worse than the optimized version of the program. ``` ```e. ``` ```These are the comparisons of the data from the original simulations and parts ``` ```c. and d. normalized to the original data: ``` ```Total Power: ``` ``` - Original: 1.00 ``` ``` - Unrolled: 0.85 ``` ``` - Tiled: 1.95 ``` ```Memory Power: ``` ``` - Original: 1.00 ``` ``` - Unrolled: 0.88 ``` ``` - Tiled: 1.87 ``` ```Our results for total core power match the SimplePower data given in the ``` ```lecture slides, where unrolling reduces total power, but tiling increases total ``` ```power. However, our results do not match those from SimplePower for memory ``` ```power consumption in the tiling case. We show that unrolling reduces memory, ``` ```but tiling increases memory power. The SimplePower results indicate that ``` ```tiling should provide the greatest reduction in memory power usage. This ``` ```difference between our data and the SimplePower data could be due to ``` ```differences in cache configuration, the simulated workload, the type of core ``` ```modeled, the model of memoy (RAM), or the power models used. ```