heterogeneous clustered vliw microarchitectures · motivation cycle c0cc00c0 c1cc11c1 cc22c2 bus l...
TRANSCRIPT
![Page 1: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/1.jpg)
UNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors
Heterogeneous Clustered VLIW
Microarchitectures
CGO’07, San Jose, California - March 2007
Northeastern UniversityNortheastern UniversityNortheastern UniversityNortheastern University
Microarchitectures
Alex Aletà, Alex Aletà, Alex Aletà, Alex Aletà, Josep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David Kaeli
![Page 2: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/2.jpg)
Clustered Microarchitectures
� Challenges Challenges Challenges Challenges in processor designin processor designin processor designin processor design
� Wire delays
� Power consumption
� Clustering: Clustering: Clustering: Clustering: divide the system into semi-independent
units
2
units
� Each unit ⇒⇒⇒⇒ Cluster
� Fast interconnects intra-cluster
� Slow interconnects inter-clusters
� Common trend in commercial VLIW processors
� DSP/embedded domain
![Page 3: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/3.jpg)
Clustered VLIW Architecture
Register buses
Clustered VLIW processorClustered VLIW processorClustered VLIW processorClustered VLIW processor
REGISTER FILE
I-CACHE
3
CLUSTER
1CLUSTER
2CLUSTER
N
MAIN MEMORY
DATA CACHE
DATA CACHE
INT INT FP FP MEM MEM
REGISTER FILE
![Page 4: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/4.jpg)
Motivation
� Not all instructions have the same impact Not all instructions have the same impact Not all instructions have the same impact Not all instructions have the same impact
on execution timeon execution timeon execution timeon execution time
� Divide resourcesDivide resourcesDivide resourcesDivide resources
� Performance oriented clusters
� Higher voltages
L
M
I
J
A
B
C
4
� Higher voltages
� Faster
� Place critical instructions
� Power oriented clusters
� Lower voltages
� Consume less power
� Place non-critical instructions
NK D
E
F
G
H
![Page 5: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/5.jpg)
Motivation
CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus
L
M
I
J
A
B
C
C0C0C0C0 C1C1C1C1 C2C2C2C2 ICNICNICNICN
Cycle timeCycle timeCycle timeCycle time 1111 1111 1111 1111Homogeneous
5
CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus
0000 AAAA
1111 BBBB Com ACom ACom ACom A
2222 CCCC IIII LLLL
3333 DDDD JJJJ MMMM
4444 EEEE KKKK NNNN
5555 FFFF
6666 GGGG
7777 HHHH
NK D
E
F
G
H
Scheduling
![Page 6: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/6.jpg)
Motivation
CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus
L
M
I
J
A
B
C
C0C0C0C0 C1C1C1C1 C2C2C2C2 ICNICNICNICN
Cycle timeCycle timeCycle timeCycle time 1111 2222 2222 1111Heterogeneous
6
CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus
0000 AAAA
1111 BBBB Com ACom ACom ACom A
2222 CCCCIIII LLLL
3333 DDDD
4444 EEEEJJJJ MMMM
5555 FFFF
6666 GGGGKKKK NNNN
7777 HHHH
NK D
E
F
G
H
Scheduling
![Page 7: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/7.jpg)
Talk Outline
� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture
� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques
7
� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation
� ConclusionsConclusionsConclusionsConclusions
![Page 8: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/8.jpg)
Heterogeneous Architecture
� Configured similar to a multiple clock domain designConfigured similar to a multiple clock domain designConfigured similar to a multiple clock domain designConfigured similar to a multiple clock domain design
� Domain boundaries:
� Each cluster
� Inter-connection Network
� Memory hierarchy
8
� Each domain can use a different voltage / frequencyEach domain can use a different voltage / frequencyEach domain can use a different voltage / frequencyEach domain can use a different voltage / frequency
� Performance oriented: higher voltage/frequency
� Power oriented: lower voltage/frequency
� Communication between domains: synchronization Communication between domains: synchronization Communication between domains: synchronization Communication between domains: synchronization
queuesqueuesqueuesqueues
![Page 9: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/9.jpg)
I-CACHE
Heterogeneous Architecture
Register buses
Clustered VLIW processorClustered VLIW processorClustered VLIW processorClustered VLIW processor synchronization
queues
9
DATA CACHE
INT INT FP FP MEM MEM
REGISTER FILECLUSTER
1CLUSTER
2CLUSTER
N
MAIN MEMORY
DATA CACHE
![Page 10: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/10.jpg)
Distributed Control Path
� Fetch and decode units distributed among clustersFetch and decode units distributed among clustersFetch and decode units distributed among clustersFetch and decode units distributed among clusters
� Instruction lay-out
� Grouped by cluster
I1C1 I1C4I1C3I1C2 I2C1 I2C4I2C3I2C2 I3C1 I3C4I3C3I3C2
Centralized Control Path
10
I1C1 I3C1I2C1 I1C2 I3C2I2C2
I1C3 I3C3I2C3 I1C4 I3C4I2C4
I C1 I C4I C3I C2 I C1 I C4I C3I C2 I C1 I C4I C3I C2
PC
... ...
......Distributed
Control Path
PC3 PC4
PC1 PC2
[Zhong et al. PACT’05]
![Page 11: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/11.jpg)
Talk Outline
� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture
� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques
11
� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation
� ConclusionsConclusionsConclusionsConclusions
![Page 12: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/12.jpg)
Statically Scheduled Processors
� Performance relies on the compilerPerformance relies on the compilerPerformance relies on the compilerPerformance relies on the compiler
� Instruction scheduling
� Register allocation
� Clustered microarchitectures
12
� Clustered microarchitectures
� Cluster assignment
� Communications
� Multimedia and numeric codeMultimedia and numeric codeMultimedia and numeric codeMultimedia and numeric code
� Majority of the execution in loop bodies
� Modulo scheduling
![Page 13: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/13.jpg)
cycle 2cycle 2cycle 2cycle 2
cycle 1cycle 1cycle 1cycle 1
Modulo Scheduling
� Effective technique for Effective technique for Effective technique for Effective technique for scheduling loopsscheduling loopsscheduling loopsscheduling loops
� Overlaps loop iterations
� Increases parallelism
Iteration space
3rd It
A
B
1st It
13
cycle 6cycle 6cycle 6cycle 6
cycle 5cycle 5cycle 5cycle 5
cycle 4cycle 4cycle 4cycle 4
cycle 3cycle 3cycle 3cycle 3
cycle 2cycle 2cycle 2cycle 2
A
B
C
4th ItA
B
C
3rd ItB
C
Tim
e
Kernel: pattern repeated every II cyclesA
B
C
2nd It
II ≥MII (Minimum Initiation Interval)
� Constrained by resources and recurrences
�MII = max {resMII, recMII}
![Page 14: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/14.jpg)
MS for Heterogeneous
A
B
CD
EAcycle 4cycle 4cycle 4cycle 4
cycle 3cycle 3cycle 3cycle 3
cycle 2cycle 2cycle 2cycle 2
cycle 1cycle 1cycle 1cycle 1
Iteration space
1st It
2nd It
AE
KernelEach cluster has its own II
Constant Iteration Time (IT)
14
EA
B
CD
EA
B
CD
E
cycle 6cycle 6cycle 6cycle 6
cycle 5cycle 5cycle 5cycle 5
cycle 4cycle 4cycle 4cycle 4
Tim
e
3rd It
A
B
CD
E
MIT = max {resMIT, recMIT}
![Page 15: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/15.jpg)
Proposed Technique
Profile homogenous
architecture
15
Select frequencies
and voltages
Modulo Scheduling
![Page 16: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/16.jpg)
Select Voltages and Frequencies
� For each domainFor each domainFor each domainFor each domain
� At program levelAt program levelAt program levelAt program level
� Consider different delays between fast and slow Consider different delays between fast and slow Consider different delays between fast and slow Consider different delays between fast and slow
domainsdomainsdomainsdomains
16
domainsdomainsdomainsdomains
� Estimate execution time
� Estimate energy consumption
� Select minimum ED2
![Page 17: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/17.jpg)
Estimate Execution Time
� Use profilingUse profilingUse profilingUse profiling
� Texec= niters · (II + SC - 1) · Tcycle
� Estimated Estimated Estimated Estimated IT
� Enough to accommodate
� All instructions
17
� All instructions
� All recurrences
� Value lifetimes of the profiled homogeneous
� Communications required by the profiled homogeneous
![Page 18: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/18.jpg)
Estimate Energy Consumption
� Same microarchitecture, different voltagesSame microarchitecture, different voltagesSame microarchitecture, different voltagesSame microarchitecture, different voltages
� Relative dynamic power
� Relative static power
2
22
2
11
2
1
ddLt
ddLt
dyn
dyn
VCfp
VCfp
P
P
⋅⋅⋅
⋅⋅⋅=
2
22
2
11
dd
dd
Vf
Vf
⋅
⋅=
18
� Relative static power
20
0
10
0
2
1
2
1
10
10
ddS
V
t
ddS
V
t
stat
stat
VWW
I
VWW
I
P
P
th
th
⋅⋅⋅
⋅⋅⋅
=−
−
2
112
10dd
ddS
VV
V
Vthth
⋅=
−
![Page 19: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/19.jpg)
MS Algorithm
Compute
MIT
IT:=MIT
Increase
IT
Select IIs
& freqs
OK?NO
ClusterClusterClusterCluster Supported cycle timesSupported cycle timesSupported cycle timesSupported cycle times
19
C0C0C0C0 1ns1ns1ns1ns
C1C1C1C1 5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns
IT= 7ns II(C0)= 7 /1 = 7 cycles
II(C1)= 7 / 1.3333 = 5.25
II(C1)= 7 / 1.25 = 5.6
II(C1)= 7 / 1.5 = 4.667
IT= 8ns II(C0)= 8 / 1= 8 cycles
II(C1)= 8 / 1.3333 = 6
II(C1)= 8 / 1.25 = 6.4
![Page 20: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/20.jpg)
MS Algorithm
Compute
MIT
IT:=MIT
Increase
IT
Select IIs
& freqs
OK?
YES
NO
20
Done
Partition
DDG
Schedule
OK?YES
YES
NO
![Page 21: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/21.jpg)
MS for Heterogeneous
�Multilevel graph partitioning algorithmMultilevel graph partitioning algorithmMultilevel graph partitioning algorithmMultilevel graph partitioning algorithm
� Coarsening
� Place critical recurrences in fast clusters
� Refinement
� Estimate execution time and energy consumption
�
21
� Optimize for ED2
� SchedulingSchedulingSchedulingScheduling
� Instructions delayed due to synchronization hazards
![Page 22: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/22.jpg)
Recurrence Constrained Loops
� Large benefits expectedLarge benefits expectedLarge benefits expectedLarge benefits expected
� A small number of instructions are critical for execution time
� 2222----cycle latency instructionscycle latency instructionscycle latency instructionscycle latency instructions
� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster
A
B
22
� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster
� HomogeneousHomogeneousHomogeneousHomogeneous� Recurrence: 6 cycles
� II= 6 cycles (all clusters!) lots of unused slots
B
C
H
I
J
K
D
E
F
G
� HeterogeneousHeterogeneousHeterogeneousHeterogeneous� 1 fast cluster, cycle time= 1ns
� 3 slow clusters, cycle time= 2ns
IT= 6ns
II(fast cluster)= 6
II(slow clusters)= 3
![Page 23: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/23.jpg)
Talk Outline
� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture
� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques
23
� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation
� ConclusionsConclusionsConclusionsConclusions
![Page 24: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/24.jpg)
Experimental Environment
�MicroarchitectureMicroarchitectureMicroarchitectureMicroarchitecture
� Clusters
� 4 clusters
� 1 FP-unit, 1 INT-unit, 1 memory port, 16 registers per cluster
� Inter-connection Network
�
24
� 1-cycle latency broadcast buses
� Heterogeneity
� Clusters: 1 performance oriented / 3 power oriented
� Benchmarks:Benchmarks:Benchmarks:Benchmarks:
� SpecFP2k Fortran programs
� Loops obtained with ORC
� Baseline: homogeneous architectureBaseline: homogeneous architectureBaseline: homogeneous architectureBaseline: homogeneous architecture
![Page 25: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/25.jpg)
Results
0.8
0.9
1
1 bus
2 buses
ED2
35%30%
17%
25
0.5
0.6
0.7
wup
wise
swim
mgrid
applu
galg
el
face
rec
luca
s
fma3d
sixt
rack
aplsi
mean
35%30%
![Page 26: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/26.jpg)
High ILP Programs
� All instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution time
A
H
I
B
C
D
E
F
G
26
� All instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution time
� No benefit using different frequencies
� Higher IPC
� Dynamic energy accounts for a majority of the total energy
consumption
� Use lower Vdd and lower frequencies
� Memory hierarchy and inter-connection network can use different voltages
![Page 27: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/27.jpg)
Talk Outline
� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture
� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques
27
� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation
� ConclusionsConclusionsConclusionsConclusions
![Page 28: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/28.jpg)
Conclusions
� Heterogeneous clustered architecturesHeterogeneous clustered architecturesHeterogeneous clustered architecturesHeterogeneous clustered architectures
� Clusters run at different voltages / frequencies
� Instructions that impact execution time scheduled in fast
clusters
� Remaining instructions in power-oriented clusters
28
� Proposed compiler techniquesProposed compiler techniquesProposed compiler techniquesProposed compiler techniques
� Algorithm to select the voltages / frequencies
� MS for heterogeneous configurations
� EDEDEDED2222 : 15% improvement on average: 15% improvement on average: 15% improvement on average: 15% improvement on average
� Up to 35% for selected programs
![Page 29: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/29.jpg)
UNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors
Heterogeneous Clustered VLIW
Microarchitectures
CGO’07, San Jose, California - March 2007
Microarchitectures
Alex Aletà, Alex Aletà, Alex Aletà, Alex Aletà, Josep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David Kaeli
![Page 30: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/30.jpg)
Motivating Example
C1C1C1C1 BusBusBusBus C2C2C2C2
0000 AAAA
1111 BBBB Comm AComm AComm AComm A
2222 CCCC EEEE
3333 DDDD
A
B E
30
3333 DDDD
C
B E
D
C1C1C1C1 BusBusBusBus C2C2C2C2
0000 AAAA
1111 BBBB Comm AComm AComm AComm A
2222 CCCCEEEE
3333 DDDD
C1C1C1C1 C2C2C2C2
Cycle timeCycle timeCycle timeCycle time 1 ns1 ns1 ns1 ns 1 ns1 ns1 ns1 ns2 ns
![Page 31: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/31.jpg)
Heterogeneous Architecture
� SignalsSignalsSignalsSignals
enable_C1
Freq.
C1
clock
31
enable_mem
enable_CN
Freq.
Multiplier
Divider
syncqueues
bus
CN
enable_all
clock
On Chip
Memory
![Page 32: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/32.jpg)
Branch Instructions
� Branches decoupled in several instructions Branches decoupled in several instructions Branches decoupled in several instructions Branches decoupled in several instructions
(Unbundled Branch Architecture)(Unbundled Branch Architecture)(Unbundled Branch Architecture)(Unbundled Branch Architecture)
� Branch target computation
� Independent in each cluster
� Branch condition evaluation
32
� Branch condition evaluation
� Computed in one cluster
� Broadcasted to the rest
� Control transfer
� Different in each cluster
![Page 33: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/33.jpg)
Results
0.8
0.9
1
0.8
0.9
1
ED2 ED21 bus 2 buses
33
0.5
0.6
0.7
wup
wis
esw
imm
grid
appl
uga
lgel
face
rec
luca
sfm
a3d
sixt
rack
apls
im
ean
0.5
0.6
0.7
wup
wis
esw
imm
grid
appl
uga
lgel
face
rec
luca
sfm
a3d
sixt
rack
apls
im
ean
![Page 34: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/34.jpg)
Recurrence Constrained Loops
� Largest benefits obtainedLargest benefits obtainedLargest benefits obtainedLargest benefits obtained
� A small number of instructions are critical for execution time
� Improvement: 30% - 35% for 189.lucas and 200.sixtrack
� 2222----cycle latency instructionscycle latency instructionscycle latency instructionscycle latency instructions
� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster
A
B
34
� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster
� HomogeneousHomogeneousHomogeneousHomogeneous� Recurrence: 6 cycles
� II= 6 cycles (all clusters!) lots of unused slots
B
C
H
I
J
K
D
E
F
G
� HeterogeneousHeterogeneousHeterogeneousHeterogeneous� 1 fast cluster, cycle time= 1ns
� 3 slow clusters, cycle time= 2ns
IT= 6ns
II(fast cluster)= 6
II(slow clusters)= 3
![Page 35: Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L M I J A B C C0CC00C0 C1 CC11C1 C2 CC22C2 ICN ICN Cycle time 1111 1 111 1 111 1 111](https://reader036.vdocuments.co/reader036/viewer/2022071219/6055d2c24b8c8828075709c0/html5/thumbnails/35.jpg)
Smallest Benefits
� Around 5%Around 5%Around 5%Around 5%
� 168.wupwise
� Loops have different characteristics
� 173.applu
35
� Low number of iterations
� Schedule length has a big impact