Statistics
| Revision:

root / hw5 / hw5.txt @ 44

History | View | Annotate | Download (4.73 KB)

1
HW5
2
Chris Mar (cmar)
3
Kevin Woo (kwoo)
4
We worked together on all parts.
5

    
6
a. 
7
sim_CPI                      0.5901 # cycles per instruction
8
dl1.miss_rate                0.1019 # miss rate (i.e., misses/ref)
9
dcache_power_cc3       8665100.5508 # total power usage of dcache_cc3
10
dcache2_power_cc3      7120581.8623 # total power usage of dcache2_cc3
11
total_power_cycle_cc3  256132764.8597 # total power per cycle_cc3
12

    
13
b.
14
sim_CPI                      0.4959 # cycles per instruction
15
dl1.miss_rate                0.0339 # miss rate (i.e., misses/ref)
16
dcache_power_cc3       8167667.0943 # total power usage of dcache_cc3
17
dcache2_power_cc3      5450757.1383 # total power usage of dcache2_cc3
18
total_power_cycle_cc3  241974837.9655 # total power per cycle_cc3
19

    
20
By storing the B matrix column wise instead of row-wise, we take a penalty
21
when writing to the matrix but more than make up for it when we read the
22
result. This is because when computing the C matrix we access B many times
23
and transverse B by column. By storing it in a column wise, the columns
24
remain in cache and allow us to have less cache misses. This drops the
25
total power used because there are less misses to the cache. Overall, we
26
see a 6% decrease in the miss rate, 24% decrease in dcache power, and a 5%
27
decrease in dcache2 power. The total power of the system decrease by 5.5%.
28
Our CPI and missrate also decrease.
29

    
30
c.
31
sim_CPI                      0.6530 # cycles per instruction
32
dl1.miss_rate                0.1258 # miss rate (i.e., misses/ref)
33
dcache_power_cc3       7243250.5357 # total power usage of dcache_cc3
34
dcache2_power_cc3      6618676.5259 # total power usage of dcache2_cc3
35
total_power_cycle_cc3  218454713.4478 # total power per cycle_cc3
36

    
37
We tried loop unrolling at 2, 4, 8, 16, 32, an 64 times. The results shown
38
above are from unrolling 32 times within the inner loop of the matrix
39
multiplication. Unrolling 32 times presented the optimal point because power
40
decreased at all points before and increased when we tried unrolling by 64. 
41
There was a decrease in the statistics for CPI and increase in the d1 cache
42
miss rate because the number of d1 cache misses remained the same, but the
43
total number of cache accesses decreased. The L1 cache power decreased by 16.4%
44
and L2 cache power decreased by 7%. The total power of the system is reduced by
45
14.7%, showing a drastic improvement in power consumption. 
46

    
47
d.
48
sim_CPI                      0.5610 # cycles per instruction
49
dl1.miss_rate                0.0852 # miss rate (i.e., misses/ref)
50
dcache_power_cc3       15912831.7737 # total power usage of dcache_cc3
51
dcache2_power_cc3      13651577.5999 # total power usage of dcache2_cc3
52
total_power_cycle_cc3  500329672.6200 # total power per cycle_cc3
53

    
54
We compare this to the original but remove the optmization of storing the
55
intermediate results to s. Instead, we accumulate in the C[i][j] array
56
so that the comparison is more fair between the two. These are the baseline
57
numbers for the matrix multiply with the s variable removed:
58

    
59
sim_CPI                      0.5145 # cycles per instruction
60
dl1.miss_rate                0.0923 # miss rate (i.e., misses/ref)
61
dcache_power_cc3       12035597.9925 # total power usage of dcache_cc3
62
dcache2_power_cc3      10016792.3405 # total power usage of dcache2_cc3
63
total_power_cycle_cc3  386723845.9954 # total power per cycle_cc3
64

    
65
These results were calculated with a tile size of 128, meaning that we
66
essentially had no tiling. This tile size resulted in the lowest dcache and
67
dcache2 numbers. We tested tile sizes from 2 - 128 in powers of 2. We see
68
an 9% increase in performance in terms of CPI and a 8% decrease in the
69
cache miss rate. However, these performance benefits come at the expense
70
of power, resulting in 32% increase in dcache level 1 power, 36% increase
71
in dcache level 2 power, and 29% increase in total core power. These numbers
72
are much worse than the optimized version of the program.
73

    
74
e.
75
These are the comparisons of the data from the original simulations and parts
76
c. and d. normalized to the original data:
77
Total Power:
78
 - Original: 1.00
79
 - Unrolled: 0.85
80
 - Tiled:    1.95
81
Memory Power:
82
 - Original: 1.00
83
 - Unrolled: 0.88
84
 - Tiled:    1.87
85

    
86
Our results for total core power match the SimplePower data given in the
87
lecture slides, where unrolling reduces total power, but tiling increases total
88
power. However, our results do not match those from SimplePower for memory
89
power consumption in the tiling case. We show that unrolling reduces memory,
90
but tiling increases memory power. The SimplePower results indicate that
91
tiling should provide the greatest reduction in memory power usage. This
92
difference between our data and the SimplePower data could be due to 
93
differences in cache configuration, the simulated workload, the type of core
94
modeled, the model of memoy (RAM), or the power models used.
95