valladolid final-septiembre-2010

36
Valladolid, Septiembre 2010 “Evolución de la Arquitectura de Computadores ” Valladolid, Septiembre 2010 Prof. Mateo Valero Director

Upload: telecom-id

Post on 13-Dec-2014

509 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Valladolid final-septiembre-2010

Valladolid, Septiembre 2010

“Evolución de la Arquitectura de Computadores ”

Valladolid, Septiembre 2010

Prof. Mateo Valero Director

Page 2: Valladolid final-septiembre-2010

2Valladolid, Septiembre 2010

Technological Achievements

● Transistor (Bell Labs, 1947)

● DEC PDP-1 (1957)● IBM 7090 (1960)

● Integrated circuit (1958)

● IBM System 360 (1965)● DEC PDP-8 (1965)

● Microprocessor (1971)

● Intel 4004

Page 3: Valladolid final-septiembre-2010

3Valladolid, Septiembre 2010

Pipeline (H. Ford)

Page 4: Valladolid final-septiembre-2010

4Valladolid, Septiembre 2010

Technology Trends

4

Page 5: Valladolid final-septiembre-2010

5Valladolid, Septiembre 2010

Page 6: Valladolid final-septiembre-2010

6Valladolid, Septiembre 2010

Page 7: Valladolid final-septiembre-2010

7Valladolid, Septiembre 2010

Power DensityW

att

s/c

m2

1

10

100

1000

i386i386i486i486

Pentium® Pentium®

Pentium® ProPentium® Pro

Pentium® IIPentium® IIPentium® IIIPentium® IIIHot plateHot plate

Nuclear ReactorNuclear ReactorNuclear ReactorNuclear Reactor

RocketRocketNozzleNozzleRocketRocketNozzleNozzle

* “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – * “New Microarchitecture Challenges in the Coming Generations of CMOS Process Technologies” – Fred Pollack, Intel Corp. Micro32 conference keynote - 1999.Fred Pollack, Intel Corp. Micro32 conference keynote - 1999.

Pentium® 4Pentium® 4

Page 8: Valladolid final-septiembre-2010

8Valladolid, Septiembre 2010

Page 9: Valladolid final-septiembre-2010

9Valladolid, Septiembre 2010

High Volume Manufacturing

2004 2006 2008 2010 2012 2014 2016 2018

Technology Node (nm) 90 65 45 32 22 16 11 8

Integration Capacity (BT)

2 4 8 16 32 64 128 256

Delay = CV/I scaling 0.7 ~0.7 >0.7 Delay scaling will slow down

Energy/Logic Op scaling

>0.35 >0.5 >0.5 Energy scaling will slow down

Bulk Planar CMOS High Probability Low Probability

Alternate, 3G etc Low Probability High Probability

Variability Medium High Very High

ILD (K) ~3 <3 Reduce slowly towards 2-2.5

RC Delay 1 1 1 1 1 1 1 1

Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generation

Shekhar Borkar, Micro37, P

Technology Outlook

Page 10: Valladolid final-septiembre-2010

10Valladolid, Septiembre 2010

Lower Lower VoltageVoltage

Increase Increase Clock RateClock Rate

& & Transistor Transistor DensityDensity

We have seen increasing number of gates on a chip and increasing clock speed.

Heat becoming an unmanageable problem, Intel Processors > 100 Watts

We will not see the dramatic increases in clock speeds in the future.

However, the number of gates on a chip will continue to increase.

Increasing the number of gates into a tight knot and decreasing the cycle time of the processor

CoreCore

CacheCache

CoreCore

CacheCache

CoreCore

C1C1 C2C2

C3C3 C4C4

Cache

C1C1 C2C2

C3C3 C4C4

Cache

C1C1 C2C2

C3C3 C4C4

C1C1 C2C2

C3C3 C4C4

C1C1 C2C2

C3C3 C4C4

C1C1 C2C2

C3C3 C4C4

Page 11: Valladolid final-septiembre-2010

11Valladolid, Septiembre 2010

Increasing chip performance: Intel´s Petaflop chip

ICPP-2009, September 23rd 2009

● 80 processors in a die of 300 square mm.● Terabytes per second of memory bandwidth● Note: The barrier of the Teraflops was obtained by Intel in

1991 using 10.000 Pentium Pro processors contained in more than 85 cabinets occupying 200 square meters

● This will be possible in 3 years from nowThanks to IntelThanks to Intel

Page 12: Valladolid final-septiembre-2010

12Valladolid, Septiembre 2010

NVIDIA Fermi Architecture

Unified 768KB L2 cache serves all

threads

GigaThread hardware scheduler

assigns Thread Blocks to SMs

Wide DRAM interface provides 12 GB/s

bandwidth

16 Streaming- Multiprocessors(512 cores) execute Thread Blocks

620 Gigaflops

Page 13: Valladolid final-septiembre-2010

13Valladolid, Septiembre 2010

Cell Broadband Engine TM:A Heterogeneous Multi-core Architecture

* Cell Broadband Engine is a trademark of Sony Computer Entertainment, Inc.

Page 14: Valladolid final-septiembre-2010

14Valladolid, Septiembre 2010

Intel/UPC

Since 2002 (Roger

Espasa, Toni Juan)

40 People

Microprocessor

Development

(Larrabee x86

many core)

Page 15: Valladolid final-septiembre-2010

15Valladolid, Septiembre 2010

Top10

Page 16: Valladolid final-septiembre-2010

16Valladolid, Septiembre 2010

Looking at the Gordon Bell Prize

● 1 GFlop/s; 1988; Cray Y-MP; 8 Processors● Static finite element analysis

● 1 TFlop/s; 1998; Cray T3E; 1024 Processors● Modeling of metallic magnet atoms, using a

variation of the locally self-consistent multiple scattering method.

● 1 PFlop/s; 2008; Cray XT5; 1.5x105 Processors● Superconductive materials

● 1 EFlop/s; ~2018; ?; 1x107 Processors (109 threads)

Jack Dongarra

Page 17: Valladolid final-septiembre-2010

17Valladolid, Septiembre 2010

BSC-CNS e iniciativas a nivel internacional: IESP

Build an international plan for developing the next

generation open source software for scientific high-

performance computing

Build an international plan for developing the next

generation open source software for scientific high-

performance computing

Improve the world’s simulation and modeling capability by improving the coordination and development of the HPC software environment

Page 18: Valladolid final-septiembre-2010

18Valladolid, Septiembre 2010

1 EFlop/s “Clean Sheet of Paper” Strawman

• 4 FPUs+RegFiles/Core (=6 GF @1.5GHz)

• 1 Chip = 742 Cores (=4.5TF/s)• 213MB of L1I&D; 93MB of L2

• 1 Node = 1 Proc Chip + 16 DRAMs (16GB)• 1 Group = 12 Nodes + 12 Routers (=54TF/s)• 1 Rack = 32 Groups (=1.7 PF/s)

• 384 nodes / rack• 3.6EB of Disk Storage included • 1 System = 583 Racks (=1 EF/s)

• 166 MILLION cores• 680 MILLION FPUs• 3.6PB = 0.0036 bytes/flops

• 68 MW w’aggressive assumptions

Sizing done by “balancing” power budgets with achievable capabilities

Largely due to Bill Dally

Courtesy of Peter Kogge, UND

Page 19: Valladolid final-septiembre-2010

19Valladolid, Septiembre 2010

Education for Parallel Programming

Multicore-based pacifier

I multi-core programming

I many-core

programming

We all massive parallel

prog.

I games

Page 20: Valladolid final-septiembre-2010

20Valladolid, Septiembre 2010

Navigating the Mare Nostrum

Page 21: Valladolid final-septiembre-2010

21Valladolid, Septiembre 2010

Initial developments

Mechanical machines

1854: Boolean algebra by G. Boole

1904: Diode vacuum tube by J.A. Fleming

1938: Boolean Algebra & Electronics Switches, C. Shannon

1946: ENIAC by J.P. Eckert and J. Mauchly

1945: Stored program by J.V. Neumann ??????

1947 : First transistor (Bell Labs)

1949: EDSAC by M. Wilkes

1952: UNIVAC I and IBM 701

Page 22: Valladolid final-septiembre-2010

22Valladolid, Septiembre 2010

In 50 Years ...

Eniac, Eckert&Mauchly1946 ... 18000 vacuum tubes

Pentium III playing DVD, 1998 ... 24 M transistors

Page 23: Valladolid final-septiembre-2010

23Valladolid, Septiembre 2010

2X transistors/Chip Every 1.5 years

Called “Moore’s Law”

Moore’s Law

Microprocessors have become smaller, denser, and more powerful.Not just processors, bandwidth, storage, etc

Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

Technology Trends: Microprocessor Capacity

Page 24: Valladolid final-septiembre-2010

24Valladolid, Septiembre 2010

Page 25: Valladolid final-septiembre-2010

25Valladolid, Septiembre 2010

Computer Architecture Achievements

• 1951 : Microprogramming (M. Wilkes)

• 1962 : Virtual Memory (Atlas, Manchester)

• 1964 : Pipeline (CDC 6600, S. Cray, 10 Mflop/s)

• 1965 : Cache memory (M. Wilkes)

• 1975 : Vector processors (S. Cray)

• 1980 : RISC architecture (IBM, Berkeley, Stanford)

• 1982 : Multiprocessors with distributed memory

• 1990 : Superscalar processors: PA-Risc (HP) and RS-6000 (IBM)

• 1991 : Multiprocessors with distributed shared memory

• 1994 : SMT (M. Nemirowski, D. Tullsen, S. Eggers)

• 1994 : Speculative Multiprocessors ( G. Sohi, Winsconsin)

• 1996 : Value Prediction (J.P.Shen and M.Lipasti, CMU)

• 2000: Multicore/Manycore Architectures

Page 26: Valladolid final-septiembre-2010

26Valladolid, Septiembre 2010

Page 27: Valladolid final-septiembre-2010

27Valladolid, Septiembre 2010

Virtual Worlds have huge potential beyond Games

Commerce & Advertising

Corporate

Education

First Responders

Government

Health

Military

Science

Community Facilitation

Social Change

Page 28: Valladolid final-septiembre-2010

28Valladolid, Septiembre 2010

● Cray XT5-HE system● Over 37,500 quad-core AMD

Opteron processors running at 2.6 GHz, 224,162 cores.

● Power: 6.95 Mwatts● 300 terabytes of memory● 10 petabytes of disk space.● 240 gigabytes per second disk bandwidth● Cray's SeaStar2+ interconnect network.

Jaguar @ ORNL: 1.75 PF/s

Jack Dongarra

Page 29: Valladolid final-septiembre-2010

29Valladolid, Septiembre 2010

Performance analysis tools

Processor and node

Load balancing

Interconnect

ApplicationsProgramming

models

Models and prototype

MareIncognito: Project structure

4 relevant apps: Materials: SIESTA Geophisics imaging: RTM Comp. Mechanics: ALYA Plasma: EUTERPEGeneral kernels

Automatic analysisCoarse/fine grain predictionSamplingClusteringIntegration with Peekperf

Contention, CollectivesOverlap computation/communicationSlimmed NetworksDirect versus indirect networks

Contribution to new Cell designSupport for programming modelSupport for load balancingSupport for performance toolsIssues for future processors

Coordinated scheduling:Run time,Process,

JobPower efficiency

StarSs: CellSs, SMPSsOpenMP@Cell

OpenMP++MPI + OpenMP/StarSs

Page 30: Valladolid final-septiembre-2010

30Valladolid, Septiembre 2010

● Supercomputación y eCiencia● 22 grupos de élite● Más de 120

investigadores seniors● Más de 300 estudiantes

de doctorado

Application scope“Earth Sciences”

Application scope“Astrophysics”

Application scope“Engineering”

Application scope“Physics”

Application scope“Life Sciences”

Compilers and tuning of application

kernels

Programming models and performance tuning tools

Architecturesand hardwaretechnologies

BSC-CNS: vertebrador de la investigación en supercomputación en España

Page 31: Valladolid final-septiembre-2010

31Valladolid, Septiembre 2010

High Performance Computing as key-enabler

1980 1990 2000 2010 2020 2030

Capacity: # of Overnight

Loads cases run

Capacity: # of Overnight

Loads cases run

Available Computational

Capacity [Flop/s]

Available Computational

Capacity [Flop/s]

CFD-basedLOADS

& HQ

CFD-basedLOADS

& HQ

Aero Optimisation& CFD-CSM

Aero Optimisation& CFD-CSM Full MDOFull MDO

Real time CFD based

in flight simulation

Real time CFD based

in flight simulation

x106

1 Zeta (1021

)

1 Peta (1015

)

1 Tera (1012

)

1 Giga (109

)

1 Exa (1018

)

102

103

104

105

106

LES

CFD-basednoise

simulation

CFD-basednoise

simulation

RANS Low Speed

RANS High Speed

HS Design

Data Set

UnsteadyRANS

“Smart” use of HPC power:• Algorithms• Data mining• knowledge

Capability achieved during one night batch Capability achieved during one night batch Courtesy AIRBUS France

Page 32: Valladolid final-septiembre-2010

32Valladolid, Septiembre 2010

Diseño del ITER

TOKAMAK (JET, Oxford)

Page 33: Valladolid final-septiembre-2010

33Valladolid, Septiembre 2010

Supercomputación, teoría y experimentación

Cortesia de IBM

Page 34: Valladolid final-septiembre-2010

34Valladolid, Septiembre 2010

Weather, Climate and Earth Sciences:Roadmap

2009

Resolution : 80 Km

Memory: ≈110 GBStorage: ≈ 8 TBNEC-SX9 48 vector procs: ≈ 40 days run

2015

Resolution : 20 Km

MemSory: ≈ 3,5 TBStorage: ≈ 180 TBHigh resolution model with complete carbon cycle modelChallenges: data viz and post-processing, data discovery, archiving

2020

Resolution : 1 Km

Memory: ≈ 4 PBStorage: ≈ 150 PBHigher resolution with global cloud resolving modelChallenges: data sharing, transfer memory management, I/O management

FLOPS 3* 1014

FLOPS 1* 1016

FLOPS 1* 1019

Page 35: Valladolid final-septiembre-2010

35Valladolid, Septiembre 2010

Education for Parallel Programming

Multicore-based pacifier

I multi-core programming

I many-core

programming

We all massive parallel

prog.

I games

Page 36: Valladolid final-septiembre-2010

36Valladolid, Septiembre 2010

Navigating the Mare Nostrum