## root / hw5 / hw5.txt @ 44

History | View | Annotate | Download (4.73 KB)

1 |
HW5 |
---|---|

2 |
Chris Mar (cmar) |

3 |
Kevin Woo (kwoo) |

4 |
We worked together on all parts. |

5 | |

6 |
a. |

7 |
sim_CPI 0.5901 # cycles per instruction |

8 |
dl1.miss_rate 0.1019 # miss rate (i.e., misses/ref) |

9 |
dcache_power_cc3 8665100.5508 # total power usage of dcache_cc3 |

10 |
dcache2_power_cc3 7120581.8623 # total power usage of dcache2_cc3 |

11 |
total_power_cycle_cc3 256132764.8597 # total power per cycle_cc3 |

12 | |

13 |
b. |

14 |
sim_CPI 0.4959 # cycles per instruction |

15 |
dl1.miss_rate 0.0339 # miss rate (i.e., misses/ref) |

16 |
dcache_power_cc3 8167667.0943 # total power usage of dcache_cc3 |

17 |
dcache2_power_cc3 5450757.1383 # total power usage of dcache2_cc3 |

18 |
total_power_cycle_cc3 241974837.9655 # total power per cycle_cc3 |

19 | |

20 |
By storing the B matrix column wise instead of row-wise, we take a penalty |

21 |
when writing to the matrix but more than make up for it when we read the |

22 |
result. This is because when computing the C matrix we access B many times |

23 |
and transverse B by column. By storing it in a column wise, the columns |

24 |
remain in cache and allow us to have less cache misses. This drops the |

25 |
total power used because there are less misses to the cache. Overall, we |

26 |
see a 6% decrease in the miss rate, 24% decrease in dcache power, and a 5% |

27 |
decrease in dcache2 power. The total power of the system decrease by 5.5%. |

28 |
Our CPI and missrate also decrease. |

29 | |

30 |
c. |

31 |
sim_CPI 0.6530 # cycles per instruction |

32 |
dl1.miss_rate 0.1258 # miss rate (i.e., misses/ref) |

33 |
dcache_power_cc3 7243250.5357 # total power usage of dcache_cc3 |

34 |
dcache2_power_cc3 6618676.5259 # total power usage of dcache2_cc3 |

35 |
total_power_cycle_cc3 218454713.4478 # total power per cycle_cc3 |

36 | |

37 |
We tried loop unrolling at 2, 4, 8, 16, 32, an 64 times. The results shown |

38 |
above are from unrolling 32 times within the inner loop of the matrix |

39 |
multiplication. Unrolling 32 times presented the optimal point because power |

40 |
decreased at all points before and increased when we tried unrolling by 64. |

41 |
There was a decrease in the statistics for CPI and increase in the d1 cache |

42 |
miss rate because the number of d1 cache misses remained the same, but the |

43 |
total number of cache accesses decreased. The L1 cache power decreased by 16.4% |

44 |
and L2 cache power decreased by 7%. The total power of the system is reduced by |

45 |
14.7%, showing a drastic improvement in power consumption. |

46 | |

47 |
d. |

48 |
sim_CPI 0.5610 # cycles per instruction |

49 |
dl1.miss_rate 0.0852 # miss rate (i.e., misses/ref) |

50 |
dcache_power_cc3 15912831.7737 # total power usage of dcache_cc3 |

51 |
dcache2_power_cc3 13651577.5999 # total power usage of dcache2_cc3 |

52 |
total_power_cycle_cc3 500329672.6200 # total power per cycle_cc3 |

53 | |

54 |
We compare this to the original but remove the optmization of storing the |

55 |
intermediate results to s. Instead, we accumulate in the C[i][j] array |

56 |
so that the comparison is more fair between the two. These are the baseline |

57 |
numbers for the matrix multiply with the s variable removed: |

58 | |

59 |
sim_CPI 0.5145 # cycles per instruction |

60 |
dl1.miss_rate 0.0923 # miss rate (i.e., misses/ref) |

61 |
dcache_power_cc3 12035597.9925 # total power usage of dcache_cc3 |

62 |
dcache2_power_cc3 10016792.3405 # total power usage of dcache2_cc3 |

63 |
total_power_cycle_cc3 386723845.9954 # total power per cycle_cc3 |

64 | |

65 |
These results were calculated with a tile size of 128, meaning that we |

66 |
essentially had no tiling. This tile size resulted in the lowest dcache and |

67 |
dcache2 numbers. We tested tile sizes from 2 - 128 in powers of 2. We see |

68 |
an 9% increase in performance in terms of CPI and a 8% decrease in the |

69 |
cache miss rate. However, these performance benefits come at the expense |

70 |
of power, resulting in 32% increase in dcache level 1 power, 36% increase |

71 |
in dcache level 2 power, and 29% increase in total core power. These numbers |

72 |
are much worse than the optimized version of the program. |

73 | |

74 |
e. |

75 |
These are the comparisons of the data from the original simulations and parts |

76 |
c. and d. normalized to the original data: |

77 |
Total Power: |

78 |
- Original: 1.00 |

79 |
- Unrolled: 0.85 |

80 |
- Tiled: 1.95 |

81 |
Memory Power: |

82 |
- Original: 1.00 |

83 |
- Unrolled: 0.88 |

84 |
- Tiled: 1.87 |

85 | |

86 |
Our results for total core power match the SimplePower data given in the |

87 |
lecture slides, where unrolling reduces total power, but tiling increases total |

88 |
power. However, our results do not match those from SimplePower for memory |

89 |
power consumption in the tiling case. We show that unrolling reduces memory, |

90 |
but tiling increases memory power. The SimplePower results indicate that |

91 |
tiling should provide the greatest reduction in memory power usage. This |

92 |
difference between our data and the SimplePower data could be due to |

93 |
differences in cache configuration, the simulated workload, the type of core |

94 |
modeled, the model of memoy (RAM), or the power models used. |

95 |