computo paralelo

8/4/2019 Computo paralelo

1/173

Historia del Cmputo Paralelo

Inicia en el ao de 1955 con Gene Amdahl en los Estados Unidos,trabajando en la compaa IBM.

Que es el procesamiento en paralelo?

Es la divisin del trabajo en pequeas tareas.Asignar varias pequeas tareas a mltiples empleados para trabajar

simultneamente.

El procesamiento en paralelo es el uso de mltiples procesadores para

ejecutar diferentes partes del mismo programa simultneamente.


2/173

Dificultades:

coordinacin,controlar ymonitorear trabajadores.

Las principales metas del procesamiento en paralelo son:

Resolver problemas ms grandes ms rpido.Reducir el tiempo de ejecucin de los programas de cmputo.Incrementar el tamao de los problemas computacionales que se pueden

resolver.

En la actualidad existen varias lneas de investigacin dentro del cmputoparalelo, as como tambin diversos equipos, lenguajes y aplicaciones.


3/173

Ventajas

La idea de crear el cmputo paralelo surge de las siguientes necesidades:oefectuar operaciones mucho ms rpido,oprocesar grandes volmenes de informacin yoobtener resultados en el menor tiempo posible

Estas necesidades son el por qu y a la vez las principales metas delcmputo paralelo.

Definicin. Computadora paralela:

Es una coleccin de procesadores interconectados de alguna forma dentro deun mismo gabinete, para intercambiar informacin.


4/173

Taxonoma (del griego , taxis, "ordenamiento", y , nomos, "norma"o "regla"). Significa ciencia de la clasificacin.

La clasificacin de computadoras debe ser en base a las caractersticas msnotorias y no en las ms detalladas que aparecen en las hojas de datos (datasheets). Existen varias taxonomas o clasificaciones de computadoras, como lade Skillicorn, la de Shore (6 tipos), la de Handler, y la taxonoma estructural deHockney and Jesshope. La ms importante de estas clasificaciones es la de

Flynn.

Taxonoma de Flynn: clasifica las computadoras (su arquitectura y sussistemas operativos) de acuerdo a la capacidad de un sistema de procesar:uno o ms flujos simultneos de datos (data streams).


5/173

uno o ms flujos simultneos de instrucciones (instruction stream).

SISD: un flujo de instrucciones nico trabaja sobre flujo de datos nico(arquitectura clsica, superescalares).


6/173

SIMD: un flujo de instrucciones nico trabaja sobre un flujo de datosmltiple (computadores matriciales).

MISD: un flujo de instrucciones mltiple trabaja sobre un flujo. de datosnico (clase no implementada, resultado de la clasificacin)

MIMD: un flujo de instrucciones mltiple trabaja sobre un flujo de datosmltiple (multiprocesadores).


7/173

Paradigmas del cmputo paralelo.Paradigmas del software paralelo.

Existen varios mtodos de programar computadoras paralelas. Los dos mscomunes son:

paso de mensajes y

paralelismo de datos.

Paso de mensajes (message passing)el usuario hace llamado a libreraspara especficamente compartir informacin entre procesadores.

Paralelismo de datos (data parallel) la particin de datos determina el

paralelismo.Memoria compartida (shared memory) multiples procesadorescomparten un espacio de memoria comn.


8/173

Operacin remota de memoria (remote memory operation). Un conjunto deprocesadores en los cuales un proceso puede accesar la memoria de otro

proceso sin su participacin.Hilos ( threads). Un slo proceso tiene mltiples (y concurrentes) rutas de

ejecucin.Modelos combinados (combined models). Compuestos de dos o ms

modelos de los mencionados arriba.

Nota: estos modelos son independientes de la mquina/arquitectura; cualquierade los modelos puede ser implementado en cualquier hardware con el soportede un sistema operativo apropiado. Una implementacin efectiva es aquellaque se acerque ms al modelo de hardware y le d al usuario facilidad en laprogramacin.


9/173

Message PassingThe message passing model is defined as:

set of processes using only local memory.

processes communicate by sending and receiving messages. data transfer requires cooperative operations to be performed by each

process (a send operation must have a matching receive).Programming with message passing is done by linking with and making calls tolibraries which manage the data exchange between processors. Message passinglibraries are available for most modern programming languages.

Data Parallel

The data parallel model is defined as: Each process works on a different part of the same data structure

Commonly a Single Program Multiple Data (SPMD) approach

Data is distributed across processors All message passing is done invisibly to the programmer Commonly built "on top of" one of the common message passing libraries

Programming with data parallel model is accomplished by writing a program withdata parallel constructs and compiling it with a data parallel compiler.


10/173

Clasificacin del Paralelismo.

Temporal (pipeline).oUn programa se ejecuta de manera secuencial y en cierto momento se

dividen las tareas a varias unidades de procesamiento. Al terminar laejecucin en cada unidad, de nuevo se retoma la ejecucin secuencial.

EspacialoEl paralelismo espacial se produce cuando se tiene varios procesadores

y se puede ejecutar un proceso en cada uno de ellos de forma ms omenos independiente. En el caso ptimo, el tiempo de ejecucin sedivide por el nmero de procesadores que estn trabajando.

IndependienteoEste paralelismo no depende de la topologa de la red de procesadores,

ya que el programa no adapta su estructura al de las conexiones de lared, es decir, no importa cmo estn conectados los procesadores, laejecucin del programa en paralelo se realiza.


11/173

Niveles de paralelismo:

Existen dos cualidades para la programacin en paralelo.Granularidad: Es el tamao relativo de la unidad de cmputo que ejecuta

en paralelo. Esta unidad puede ser una declaracin, una funcin o unproceso completo.

Canal de comunicacin: Es el mecanismo bsico por el cual las unidadesindependientes del programa intercambian datos y sincronizan su

actividad.

Nivel de declaraciones (statement).oEs el nivel mas fino de granularidad.oSe utiliza en lenguajes como Power C, Fortran 79/90, Power Fortran

79/90.oSe usan variables en comn dentro de un sistema nico de memoria.


12/173

Nivel de hilos (thread)oUn hilo es un estado independiente de ejecucin, dentro del contexto

de un programa ms grande, esto es:Un conjunto de registros mquinaUna pila (stack) de llamadasLa habilidad de ejecutar cdigo.

oUn programa puede crear varios hilos para ejecutarse en el mismoespacio de direccin. Las razones de usar hilos son la portabilidad y el

desempeo. Se comparten los recursos de un solo procesador.Nivel de procesos (process).

oUn proceso en UNIX consiste en:Un espacio de direccinUn gran nmero de valores de estado de proceso.Un hilo de ejecucin.


13/173

oEl mecanismo para comunicacin entre procesos puede usarse paraIntercambiar datos Coordinar las actividades de mltiples procesos

asncronos.oUn proceso puede crear uno o ms procesos, el proceso que crea otro

se llama proceso padre y el creado se llama proceso hijo. El procesoinicial se llama raz.


14/173

Desempeo.

El desempeo de un programa en paralelo se mide bajo tres trminos:

Aceleracin (Speed up) Eficiencia Costo

Donde:

T = tiempo en segundosT1= Tiempo usando una unidad de procesamientoTp = Tiempo usando dos o ms de unidades de procesamientoP = nmero de unidades de procesamiento


15/173

Ley de Amdahl.La ley de Amdahl se refiere a la aceleracin de usar procesadores en

paralelo en un problema a comparacin de usar solo un procesador serial.Para entender lo que es aceleracin, primero veremos que es velocidad:

oLa velocidad de un programa es el tiempo que le toma para serejecutado. Esto puede ser medido en cualquier incremento detiempo.

oLa aceleracin (speedup) se define como el tiempo que le toma a un

programa ejecutarse en serial (con un procesador) dividido por eltiempo que le toma ejecutarse en paralelo (con varios procesadores).


16/173


17/173

Arquitecturas Paralelas.

Criterios Antiguos:

Arquitecturas paralelas ligadas a modelos de programacin.Arquitecturas divergentes, sin ningn patrn de crecimiento.


18/173


Criterios actuales:

Extensin de la arquitectura de computadores para soportarcomunicaciones y cooperacin.oANTES: Conjunto de instrucciones.oAHORA: Comunicaciones.

Hay que definir:oAbstracciones, fronteras, primitivas (interfaces)oEstructuras que implementan los interfaces (hw o sw)

Compiladores, libreras y OS son cuestiones importantes en nuestros das.


19/173


Recordemos que:

Un computador paralelo es un conjunto de elementos de proceso que secomunican y cooperan para resolver rpidamente grandes problemas.

Podemos decir que la Arquitectura Paralela es:

Arquitectura convencional+

Arquitectura de comunicacin


20/173


21/173


Modelos de programacinEspecifica las comunicaciones y la sincronizacin.Ejemplos:

oMultiprogramacin: no hay comunicacin o sincronismo. Paralelismoa nivel de programa.

oMemoria compartida: como un tabln de anuncios.oPaso de mensajes: como cartas o llamadas telefnicas, punto a punto.oParalelismo de datos: varios agentes actan sobre datos individuales y

luego intercambian informacin de control antes de seguir el proceso.El intercambio se implementa con memoria compartida o con paso

de mensajes.


22/173


Niveles de abstraccin en la comunicacin.


23/173


Evolucin de los modelos arquitectnicos:

Modelo de programacin, comunicacin y organizacin de la mquinacomponen la arquitectura.Espacio de memoria compartida.

Paso de mensajes.Paralelismo de datos.Otras:

Flujo de datos.Arrays sistlicos.


24/173

Memoria Compartida.

Cualquier procesador puede referenciar directamente cualquier posicin dememoria.oLa comunicacin se realiza implcitamente por medio de cargas y

almacenamientos.

Ventajas:oLocalizacin transparente.oProgramacin similar a tiempo compartido en uniprocesadores.

Excepto que los procesos se ejecutan en diferentes procesadores.Buen rendimiento en distribucin de carga.


25/173


26/173

Memoria Compartida.


27/173

Memoria Compartida.

Hardware de Comunicacin:

Aumento independiente de capacidades de memoria, de I/O o de procesoaadiendo mdulos, controladores o procesadores.


28/173

Memoria Compartida.

Estrategia de comunicaciones en Mainframe:

Red de barras cruzadas.Inicialmente limitado por el coste de los procesadores. Despus, por el

coste de la red.El ancho de banda crece conp.Alto coste de ampliacin; uso de redes multietapa.


29/173

Memoria Compartida.

Estrategia de comunicaciones en Minicomputer:

Casi todos los sistemas con microprocesadores usan bus.Muy usados para computacin paralela.Llamados SMP, symmetric multiprocessor.El bus puede ser un cuello de botella.Problema de la coherencia en cache.Bajo coste de ampliacin.


30/173

Memoria Compartida.

Ejemplo: Intel Pentium Pro Quad. Coherencia y multiproceso integrados en el modulo procesador.


31/173

Memoria Compartida.

Ejemplo: SUN Enterprise. 16 tarjetas de cualquier tipo:procesadores + memoria, o I/O. El acceso a la memoria es por bus, simtrico.


32/173

Memoria Compartida.

Otras opciones en comunicacin:

Problemas de interconexin: coste (barras cruzadas) o ancho de banda (bus). Dance-hall: ampliable a menor coste que en barras cruzadas.

oLatencia en acceso a memoria uniforme, pero alta. NUMA (non-uniform memory access):

oConstruccin de un simple espacio de memoria con latencias diferentes. COMA: (Cache Only Memory Architecture)

oArquitectura de memoria a base de caches compartidas.


33/173

Memoria Compartida.

Ejemplo: Cray T3E Ampliable a 1024 procesadores, enlaces de 480MB/s.

El controlador de memoria genera las peticiones para posiciones no locales. No tiene mecanismo de hardware para coherencia (SGI Origin y otros s lo

proporcionan).


34/173

Paso de Mensajes.

Construidos por medio de computadores completos, incluyendo I/O.oComunicacin por medio de operaciones explcitas de I/O.

Modelo de programacin: acceso directo slo a direcciones privadas(memoria local), comunicacin por medio de mensajes (send/receive)

Diagrama de bloques similar al NUMAoPero las comunicaciones se integran a nivel de I/O.oComo redes de workstations (clusters), pero mayor integracin.oMs fciles de construir y ampliar que los sistemas NUMA.

Modelo de programacin menos integrado en el hardware.oLibreras o intervencin del sistema operativo.


35/173

Paso de Mensajes.

send especifica el buffer a transmitir y el proceso receptor.recv especifica el proceso emisor y el bufferde almacenamiento.Son copias memoria-memoria, pero se necesitan los nombres de procesos.Opcionalmente se puede incluir el destino en el envo y unas reglas de

identificacin en el destino.En la forma simple, el emparejamiento se consigue por medio de la

sincronizacin de sucesos send/recv.oExisten mltiples variantes de sincronizacin.

Grandes sobrecargas: copia, manejo de buffer, proteccin.


36/173

Paso de Mensajes.


37/173

Paso de Mensajes.Evolucin en las mquinas de Paso de Mensajes

Primeras mquinas: FIFO en cada enlace.oModelo de programacin muy prximo

al hw; operaciones simples desincronizacin.

oReemplazado por DMA, permitiendooperaciones no bloqueantes.Buffer de almacenamiento en

destino hasta recv.Disminucin de la influencia de la topologa

(enrutado por hw).oStore&forward routing: importa la topologa.oIntroduccin de redes multietapa.oMayor coste: comunicacin nodo red.oSimplificacin de la programacin


38/173

Paso de Mensajes.

Ejemplo: IBM SP-2. Realizado a base de estaciones RS6000.


39/173

Paso de Mensajes.

Ejemplo Intel Paragon.


40/173

La convergencia de las arquitecturas.

La evolucin y el papel del software ha difuminado las fronteras entrememoria compartida y paso de mensajes.osend/recv soporta memoria compartida va buffers.oSe puede construir un espacio global de direcciones en Paso de

Mensajes.

Tambin converge la organizacin del hardware.oMayor integracin para Paso de Mensajes (menor latencia, mayor

ancho de banda)oA bajo nivel, algunos sistemas de memoria compartida implementan

paso de mensajes en hardware.

Distintos modelos de programacin, pero tambin en convergencia.


41/173

Interconexin de sistemas paralelos

La misin de la red en una arquitectura paralela es transferir informacindesde cualquier fuente a cualquier destino minimizando la latencia y concoste proporcionado.

La red se compone de:onodos;oconmutadores;oenlaces.

La red se caracteriza por su:otopologa: estructura de la interconexin fsica;oenrutado: que determina las rutas que los mensajes pueden o deben

seguir en el grafo de la red;oestrategia de conmutacin: de circuitos o de paquetes;ocontrol de flujo: mecanismos de organizacin del trfico.


42/173


Clasificacin de las redes por su topologa.

Estticas:

oconexiones directas estticas punto a punto entre los nodos;ofuerte acoplamiento interfaz de red-nodo;olos vrtices del grafo de la red son nodos o conmutadores;

Se clasifican a su vez:

simtricas: anillo, hipercubo, toro;no simtricas: bus, rbol, malla.


43/173


Clasificacin de las redes por su topologa.

Dinmicas:olos conmutadores pueden variar dinmicamente los nodos que

interconectan.

Se clasifican a su vez:monoetapa;multietapa:

bloqueante (lnea base, mariposa, baraje);reconfigurable (Bene);no bloqueante (Clos).


44/173


Parmetros caractersticos de una red:

Tamao de la red: nmero de nodos que la componen.

Grado de un nodo: nmero de enlaces que inciden en el nodo.

Dimetro de la red: es el camino mnimo ms largo que se puedeencontrar entre dos nodos cualesquiera de la red.

Simetra: una red es simtrica si todos los nodos son indistinguibles desdeel punto de vista de la comunicacin.


45/173

Redes estticas


46/173

Redes estticas


47/173

Redes estticasHipercubo 3D ciclo-conexo


48/173

Redes estticasEjemplo de conexiones en un hipercubo 3

Conexin de nodos que se diferencian en el bit menos significativo

Conexin de nodos que se diferencian en el segundo bit

Conexin de nodos que se diferencian en el bit ms significativo


49/173

Redes dinamicas

Redes dinmicas: son redes cuya configuracin puede modificarse.

Hay dos tipos:omonoetapa.omultietapa.

Las redes monoetapa realizan conexiones entre elementos de procesoen una sola etapa.oPuede que no sea posible llegar desde cualquier elemento a cualquier

otro, por lo que puede ser necesario recircular la informacin (=>redesrecirculantes)

Las redes multietapa realizan conexiones entre los elementos deproceso en ms de una etapa.


50/173

Redes dinmicas

Redes de interconexin monoetapa


51/173

Redes dinmicas

Red de barras cruzadas: permite cualquier conexin.


52/173

Redes dinmicasRedes de interconexin (multietapa)

Cajas de conmutacin

Las cuatro configuraciones posibles de una caja de conmutacin de 2entradas.


53/173

Redes dinmicas bloqueantes

Redes multietapa bloqueantes.

oSe caracterizan porque no es posible establecer siempre una nuevaconexin entre un par fuente/destino libres, debido a conflictos con lasconexiones en curso.

oGeneralmente existe un nico camino posible entre cada parfuente/destino.


54/173


Red de lnea base:


55/173


Red mariposa:


56/173


Red baraje perfecto:


57/173

Redes dinmicas reconfigurables

Redes multietapa reconfigurables.

oSe caracterizan porque es posible establecer siempre una nuevaconexin entre un par fuente/destino libres, aunque haya conexiones encurso, pero puede hacerse necesario un cambio en el camino usado poralguna(s) de ellas (reconfiguracin).

oInteresante en procesadores matriciales, en donde se conocesimultneamente todas las peticiones de interconexin.


58/173


Red de Bene:


59/173


La red de Bene se puede construir recursivamente:


60/173

Redes dinmicas no bloqueantes

Redes dinmicas no bloqueantes.oSe caracterizan porque es posible establecer siempre una nueva

conexin entre un par fuente/destino libres sin restricciones.

oSon anlogas a los conmutadores de barras cruzadas, pero pueden

presentar mayor latencia, debido a las mltiples etapas.


61/173

Redes dinmicas no bloqueantes

Red de Clos:


62/173

Coherencia en Memoria Cache

Estructuras comunes de la jerarqua de memoria en multiprocesadoresMemoria cache compartida.

Memoria compartida mediante bus.

Interconexin por medio de red (dance-hall)

Memoria distribuida.


63/173


Cache compartidaPequeo nmero de procesadores

(28)Fue comn a mediados de los

80 para conectar un par de

procesadores en placa.Posible estrategia en chip

multiprocesadores.


64/173


Comparticin por medio de bus.

Ampliamente usada enmultiprocesadores de pequeay mediana escala (20-30)

Forma dominante en lasmquinas paralelas actuales.

Los microprocesadoresmodernos estn dotados parasoportar protocolos decoherencia en estaconfiguracin.


65/173


Saln de baile

Fcilmente escalable.

Estructura simtrica UMA.Memoria demasiado lejana

especialmente en grandes sistemas.


66/173


Memoria distribuidaEspecialmente atractiva par multiprocesadores escalables.Estructura no simtrica NUMA.Accesos locales rpidos.


67/173

Arquitecturas paralelas de computadoras.

Tipos de organizacin de procesadores: Existen 7 importantes mtodos deorganizacin de procesadores para conectarprocesadores en una computadoraparalela.

Redes de Rejilla (Mesh Networks).Redes de rbol binario (Binary Tree Networks).

Redes de hiperrbol (Hypertree Networks).Redes de Pirmide (Pyramid Networks).Redes de mariposa (Butterfly Network).Redes de hipercubo (Hypercube Networks).Redes de ciclos de cubos conectados o cubo en ciclos (Cube-Connected

Cycles Networks).


68/173

A la organizacin de procesadores tambin se le conoce como topologa, y deacuerdo a sta, existen criterios para evaluar cul modelo es ms conveniente

que otro. Las aplicaciones tambin pueden definir si un modelo conviene o no.Criterios de evaluacin de modelos:

Dimetro.Ancho de biseccin.

Nmero de aristas por nodo.Mxima longitud de aristas.Grado de una arquitectura.


69/173

Mesh Networks.

Nodes are arranged into a q-dimensional lattice.Communication is allowed only between neighboring nodesTwo-dimensional meshes.

oMesh with no wrap-around connections.oMesh with wrap-around connections between processors in same

row or column.

oMesh with wrap-around connections between processors in adjacentrows or columns.oWrap-around connections can connect processors in the same row or

column adjacent rows or columns.


70/173

Evaluation of the Mesh: Interior nodes communicate with2q other processors The diameter of a q-dimensional mesh withkq nodes is q(k - 1) The bisection width of a q-dimensional mesh withkq nodes iskq-1. The maximum number of edges per node is2q. The maximum edge length is a constant, independent of the number of nodes, for two- and three-

dimensional meshes. The two-dimensional mesh has been apopular topology for processor arrays.

o Goodyear Aerospace's MPPo AMT DAPo MasPar's MPo The Intel Paragon XP/S multicomputer connects processors with a two dimensional mesh.


71/173


72/173


73/173

Hypertree NetworksAn approach

to building a network with the low diameter of a binary treewith an improved bisection width.The easiest way to think of a hypertree network of degree k and depth dis toconsider the network from two different angles.

From the front a hypertree network of degreek and depthdlooks like acompletek-ary tree of heightd.

From the side, the same hypertree network looks like an upside downbinary tree of heightd.Joining the front and side views yields the complete network.


74/173

Hypertree evaluation

4-ary hypertree with depthdhas

4dleaves2d (2d+1 - 1) nodes.diameter is2d.bisection width is2d+1.number of edges per node is never more than six.maximum edge length is an increasing function of the problem size.


75/173

Hypertree network of degree 4 and depth 2(a) Front view(b) Side view(c) Complete network.Connection Machine CM-5 multicomputer is a4-ary hypertree.


76/173

Pyramid Networks

An attempt to combine advantages ofmesh networkstree networks.

A pyramid network of sizek2 isa complete4-ary rooted tree of height log2 kaugmented with additional interprocessor links

the processors in every tree level form a 2-D mesh networkA pyramid of size k2 has at its base a 2-D mesh network containing k2processors.

The total number of processors in a pyramid of sizek2 is (4/3)k2 - (1/3).The levels of the pyramid are numbered in ascending order.The base has level number 0, and the single processor at the apex of the

pyramid has level number log2 k.


77/173

Pyramid network of size 16.

Every interior processor is connected to nine other processors:one parent,four mesh neighbors,four children.


78/173


79/173

Pyramidevaluation

The advantage of the pyramid over the 2-D mesh isThe pyramid reduces the diameter of the network.

When a message must travel from one side of the mesh to the other, fewer linktraversals are required if the message travels up and down the tree rather thanacross the mesh.

The diameter of a pyramid of sizek2 is2log k.

The addition of tree links does not give a significantly higher bisectionwidth than a 2-D mesh.The bisection width of a pyramid of sizek2 is2k.The maximum number of links per node is no greater than nine,

regardless of the size of the network.Unlike a 2-D mesh, the length of the longest edge is an increasing

function of the network size.


80/173

Butterfly Network

Consists of (k+1)2k nodes divided into k+1 rows, or ranks, eachcontainingn=2k nodes.

The ranks are labeled 0 throughk.

The ranks 0 and k are sometimes combined, giving each node four

connections to other nodes.


81/173


82/173

Node connection

Letnode(i, j) refer to thejth node on the ith rank, where 0


83/173

Butterflyevaluation

As the rank numbers decrease, the widths of the wings of the butterfliesincrease exponentially.

The length of the longest network edge increases as the number ofnetwork nodes increases.

The diameter of a butterfly network with (k + 1)2k nodes is2k.The bisection width is2k-1.

A butterfly network serves to route data from non local memory to processorson the BBN TC2OOO multiprocessor.


84/173

Hypercube

A cube-connected network, also called a binary n-cube network, is abutterfly with its columns collapsed into single nodes.

Consists of2k nodes forming ak-dimensional hypercube.The nodes are labeled 0, 1 2k-1;Two nodes are adjacent if their labels differ in exactly one bit position.


85/173

A four-dimensional hypercube.


86/173

Hypercube evaluation

The diameter of a hypercube with2k nodes iskThe bisection width of that size network is2k-1,The hypercube organization has low diameterHigh bisection width at the expense of the number of edges per node and

the length of the longest edge.The number of edges per node isk-the logarithm of the number of nodes

in the network.

The length of the longest edge in a hypercube network increases as thenumber of nodes in the network increases.


87/173

Cube-Connected Cycles Networks

The cube-connected cycles network is ak-dimensional hypercube

2k "vertices" are actually cycles ofk nodes.For each dimension, every cycle has a node connected to a node in the

neighboring cycle in that dimension.


88/173

24-node cube-connected cycles network.


89/173

Cycles hypercube evaluation

Node(i, j) is connected tonode(i, m)

if and only ifm is the result of inverting the ith most significant bit of thebinary representation ofj.

Compared to the hypercube,

the cube-connected cycles processor organization has the advantage,the number of edges per node is three - a constant independent of

network size.


90/173

Disadvantage

the diameter is twice that of a hypercubethe bisection width is lower.

Given a cube-connected cycles network of sizek2k,oits diameter is2k

its bisection width is2k-1


91/173

CHARACTERISTICS OF VARIOUS PROCESSOR ORGANIZATIONS


92/173

Flujo de datos y paralelismo implcitoVon Neumann vs. Paralelo

Parallel random access machine (PRAM) Provides a mental break from the Von Neumann model and sequential algorithms. PRAM (pronounced "pea ram") model of parallel computation. Allows parallel-algorithm designers to treat processing power as an unlimited resource Unrealistically simple; Ignores the complexity of interprocessor communication.

The designer of PRAM algorithms can focus on the parallelism inherent in a particularcomputation. Cost-optimal PRAM solutions exist, meaning that the total number of operations

performed by the PRAM algorithm is of the same complexity class as an optimalsequential algorithm.

Cost-optimal PRAM algorithms can serve as a foundation for efficient algorithms onreal parallel computers.


93/173

Modelos computacionales RAM y PRAM

RAM, model of Serial ComputationThe random access machine (RAM) is a model of a one-address computer.

Consists of:memoryread-only input tape

write-only output tape program


94/173


95/173

RAM programThe program

not stored in memory can not be modified.

The input tape contains a sequence of integers. Every time an input value is read, the input head advances one square. The output head advances after every write. Memory consists of an unbounded sequence of registersr0, r1, r2, . Each register can hold a single integer. Registerr0 is the accumulator, where

computations are performed.

The exact instructions are not important, as long as they resemble the instructionsfound on an actual computer.

load, store, read, write, add, subtract, multiply, divide, test, jump, and halt.


96/173

RAM Time Complexity

The worst-case time complexity of a RAM program is the functionf(n) maximum time taken by the program to execute over all inputs of sizen.

The expected time complexity of a RAM program is the average, over all inputs of sizen, of the execution times.

Analogous definitions hold for worst-case space complexity expected space complexity.

There are two ways of measuring time and space on the RAM model. uniform cost criterion

logarithmic cost criterion.


97/173

Cost criterion

The uniform cost criterion sayseach RAM instruction requires one time unit to executeevery register requires one unit of space.

The logarithmic cost criterion takes into account that an actual word ofmemory has a limited storage capacity.

The uniform cost criterion is appropriate if the values manipulated by theprogram always fit into one computer word.


98/173

The PRAM model of parallel computation

A PRAM consists of acontrol unitglobal memoryan unbounded set of processors each with its own private memoryActive processors execute identical instructionsEvery processor has a unique index

The value of a processor's index can be used to enable or disable theprocessorInfluence which memory location it accesses


99/173


100/173

PRAM computationA PRAM computation begins with

input stored in global memorysingle active processing element.

During each step, an active enabled processorread a value from a single private or global memory locationperform a single RAM operation

write into one local or global memory location.may activate another processor.

The processors are synchronized.All active, enabled processors must execute the same instruction, on

different memory locations.

The computation terminates when the last processor halts.


101/173

Modelos de PRAM differ in how they handle read or write conflicts; when two or more processors attempt to read from, or write to, the same global

memory location.

1 EREW (Exclusive Read Exclusive Write): Read or write conflicts are not allowed.

2 CREW (Concurrent Read Exclusive Write):

Concurrent reading allowed; i.e., multiple processors may read from the sameglobal memory location during the same instruction step.Write conflicts are not allowed. (This is the default PRAM model.)

3 CRCW (Concurrent Read Concurrent Write): Concurrent reading and con-current writing allowed. A variety of CRCW models exist with different policies for handling concurrent

writes to the same global address.


102/173

Tipos de PRAM CRCWThree different models:

Common Arbitrary Priority

Various CRCW PRAM1.Common. All processors concurrently writing into the same global address must be

writing the same value.

2.Arbitrary. If multiple processors concurrently write to the same global address, one of the

competing processors is arbitrarily chosen as the "winner," and its value iswritten into the register.

3.Priority. If multiple processors concurrently write to the same global address, the

processor with the lowest index succeeds in writing its value into the memorylocation.


103/173

Strengths of PRAM models

EREW PRAMmodel is the weakest.Clearly a CREW PRAM can execute any EREW PRAM algorithm in the

same amount of time; the concurrent read facility is simply not used.

CRCW PRAMcan execute any CREW PRAMalgorithm in the same amount oftime.

PRIORITY PRAMmodel is the strongest.

Any algorithm designed for the COMMON PRAM model will executewith the same complexity on the ARBITRARY PRAM and PRIORITYPRAM models.


104/173

Strengths of PRAM models

if all processors writing to the same location write the same value,choosing an arbitrary processor would cause the same result.

if an algorithm executes correctly when an arbitrary processor is chosenas the "winner," the processor with the lowest index is as reasonable analternative as any other.

Any algorithm designed for the ARBITRARY PRAMmodel will execute withthe same time complexity on the PRIORITY PRAMmodel.

Because the PRIORITY PRAM model is stronger than the EREW PRAMmodel, an algorithm to solve a problem on the EREW PRAM can have highertime complexity tan an algorithm solving the same problem on the PRIORITY

PRAM model.


105/173

Increase in parallel time complexity

The increase in parallel time complexity can occur when moving from thePRIORITY PRAM model to the EREW PRAM model.

Lemma. A p-processor EREW PRAM can sort a p-element array stored inglobal memory in (log p) time.

Theorem. Ap-processor PRIORITY PRAM can be simulated by a p-processorEREW PRAM with the time complexity increased by a factor of(log p).


106/173

Simulation PRIORITY PRAM by EREW PRAM

Assume the PRIORITY PRAM algorithm usesprocessors P1, P2 Ppglobal memory locationsM1, M2 Mm.

The EREW PRAM usesauxiliary global memory locations T1, T2 Tp and S1, S2 Sp to simulate

each read or write step of the PRIORITY PRAM.When processor Pi in the PRIORITY PRAM algorithm accesses memorylocation Mj, processor Pi in the EREW PRAM algorithm writes the orderedpair (j,i) in memory location Ti.


107/173

Then the EREW PRAM sorts the elements ofT. This step takes time (log p)(Lemma). By reading adjacent entries in the sorted array, the highest priorityprocessor accessing any particular location can be found in constant time.

ProcessorP1 reads memory location T1, retrieves the ordered pair (i1,j1), andwrites a 1 into global memory location Sj1

The remaining processorsPk, where 2


108/173

A concurrent write operation

A concurrent write operation, which takes constant time on a p-processor

PRIORITY PRAM, can be simulated in (log p) time on ap-processor EREWPRAM.

(a) Concurrent write on the PRIORITY PRAM model.

ProcessorsP1, P2, andP4 attempt to write values to memory locationM3.ProcessorP1 wins.Processors P3 and P5 attempt to write values to memory location M7.ProcessorP3 wins.


109/173


110/173

Concurrent write on the EREW PRAM model

Each processor writes an (address, processor number) pair to a uniqueelement ofT.

The processors sort Tin time (log p).In constant time processors can set to 1 those elements of S

corresponding to the winning processors.Winning processors write their values.

For a write instruction, the highest priority processor accessing each memorylocation writes its value.

For a read instruction, the highest priority processor accessing each memorylocation reads that location's value, then duplicates the value in (log p) time so

there is a copy in a unique memory location for every processor to access thevalue.


111/173

ALGORITMOS PRAM

If a PRAM algorithm has lower time complexity than an optimal RAM algorithm, it is

because parallelism has been used.

PRAM algorithms begin with only a single processor active have two phases.

oIn the first phase a sufficient number of processors are activated,othese activated processors perform the computation in parallel.

Given a single active processor, it is easy to see that logP activation steps are bothnecessary and sufficient forp processors to become active.

Processor activation

Exactly logP processor activation steps are necessary and sufficient to change from 1

active processor top active processors.


112/173


113/173

Modelo del rbol Binario

The binary tree is one of the most important paradigms of parallel

computingIn some algorithms data flows top-down from the root of the tree to the

leaves.Broadcast and divide-and-conquer algorithms both fit this model.

In broadcast algorithms the root sends the same data to every leaf.

In divide-and-conquer algorithms the tree represents the recursivesubdivision of problems into subproblems.In other algorithms data flows bottom-up from the leaves of the tree to the root.

These are calledfan-in orreduction operations.


114/173

Reduccin Paralela

Given

a set ofn values a1, a2, , anan associative binary operator,reduction is the process of computinga1a2an

Parallel summation is an example of a reduction operation.


115/173

We represent each tree node with an element in an array.The mapping from the tree to the array is straightforward.


116/173

Sum ofn values


117/173

Complexity:

The spawn routine requires logn/2 doubling steps.

The sequential for loop executes log n times each iteration has constant timecomplexity.Hence the overall time complexity of the algorithm is (log p), given n/2processors.


118/173

Organizacin de procesadoresOrganizacin de procesadores representados por grafos

A processor organization can berepresented by a graphnodes (vertices) represent processorsedges represent communication paths between pairs of processors

Processor organizations are evaluated according to criteria that help usunderstand their effectiveness in implementing efficient parallel algorithms onreal hardware.These criteria are:

Diameter.Bisection width.Number of edges per node

Maximum edge length


119/173

Dimetro

The diameter of a network is the largest distance between two nodes.Low diameter is better,

because the diameter puts a lower bound on the complexity ofparallel algorithms requiring communication between arbitrary pairs ofnodes.


120/173

Ancho de biseccin de la red

The bisection width of a network is the minimum number of edges that

must be removed in order to divide the network into two halves (withinone).

High bisection width is better,because in algorithms requiring large amounts of data movement, thesize of the data set divided by the bisection width puts a lower boundon thecomplexity of the parallel algorithm.


121/173

Nmero de aristas por nodo

It is best if the number of edges per node is a constant independent of

the network sizeThe processor organization scales more easily to systems with large

numbers of nodes.

Maximum edge length

For scalability reasons it is best if the nodes and edges of the network can belaid out in three-dimensional space so that the maximum edge length is aconstant independent of the network size.


122/173

Introduccin a la complejidad de los algoritmos paralelosTiempos de corrimiento

Tipos de anlisis para algoritmos

1. Peor caso (usualmente)T(n) = tiempo mximo necesario para un problema de tamao n

2. Caso medio o caso promedioT(n) = tiempo esperado para un problema de tamao nSe requiere establecer una distribucin estadstica

3. Mejor casoT(n) = tiempo mnimo para un problema de tamao n. Esto puede ser engaosoporque no siempre ocurre.

l jid d d l d l f i f l


123/173

La complejidad del peor caso de un programa RAM es la funcin f(n), eltiempo mximo que le toma al programa ejecutar todas las entradas de tamaon.

La complejidad del tiempo esperado de un programa RAM es el tiempopromedio que le toma ejecutar las entradas de tamao n .

Existen dos maneras de medir el tiempo en el modelo RAM:

Criterio de costo uniforme: cada instruccin RAM requiere de una unidad detiempo para ejecutar y cada registro requiere una unidad de espacio enmemoria. Es apropiado si los valores manipulados por el programa cabendentro de una palabra de cmputo.

Criterio de costo logartmico: se toma en cuenta que una palabra actual enmemoria tiene una capacidad limitada en memoria.

L O


124/173

La gran O

Usualmente se utiliza la notacin O (g(x)) para referirse a las funciones

acotadas superiormente por la funcin g(x).

La cota ajustada asinttica (notacin ) tiene relacin con las cotas superior e

inferior asintticas (notacin ):

Una cota superior asinttica es una funcin que sirve de cota superior de otrafuncin cuando el argumento tiende a infinito.

O d l f i


125/173

Ordenes usuales para funciones:

Los rdenes ms utilizados en anlisis de algoritmos, en orden creciente, son

los siguientes (donde c representa una constante):

D di d d l d d l f i di fi i t P


126/173

Dependiendo del orden de la funcin, se dice que es eficiente o no. Para unproblema de tamao n, el orden de la funcin que lo resuelve puede ser desdeptimo (mejor caso) o eficiente hasta ineficiente (peor caso).

DISEO DE PROGRAMAS

PARALELOS


127/173

PARALELOS

Sistemas de memoria nica

Single Memory SystemsThe CHALLENGE/Onyx uses a high speed system bus to connect allcomponents of the system.

Memory has features:There is a single address map; that is, the same word of memory has the

same address in every CPU.There is no time penalty for communication between processes because

every memory word is accessible in the same amount of time from anyCPU.

All peripherals are equally accessible from any process.Processes running in different CPUs can share memory and can update the

identical memory locations concurrently.


128/173

Processes running in different CPUs can share memory and can update theidentical memory locations concurrently.

Processes map a single segment of memory into the virtual address spacesof two or more concurrent processes.

Two processes can transfer data at memory speeds, one putting the data intoa mapped segment and the other process taking the data out.

They can coordinate their access to the data using semaphores located in theshared segment.


129/173

SISTEMAS DE MEMORIA MLTIPLEMultiple Memory Systems

There is not a single address map. A word of memory in one nodecannot be addressed at all from another node.

There is a time penalty for some interprocess communication.Peripherals are accessible only in the node to which they are

physically attached.

The message-passing interface (MPI) is designed specifically for anapplication that executes concurrently in multiple nodes.

Modelos de ejecucin paralela


130/173

Modelos de ejecucin paralela

Two features of the models for parallel programming:

Granularity

Communicationchannel

The relative size of the unit of computation thatexecutes in parallel: a single statement, afunction, or an entire process.

The basic mechanism by which theindependent, concurrent units of the programexchange data and synchronize their activity.

Paralelismo a nivel procesos


131/173

Paralelismo a nivel procesosA UNIX process consists of

an address space,a large set of process state values,one thread of execution.

Interprocess communication (IPC) mechanisms can be usedoto exchange dataoto coordinate the activities of multiple, asynchronous processes.

In traditional UNIX practice, one process creates another with the system callfork(), which makes a duplicate of the calling process, after which the twocopies execute in parallel.

Typically the new process immediately uses the exec() function to load a newprogram.

Lightweight process


132/173

Lightweight process

It shares some of its process state values with its parent process.

It does not have its own address space. It continues to execute in the address space of the original process.

A lightweight process differs from a thread in two significant ways:

It has a full set of UNIX state values. Some of these, for example the table of

open file descriptors, can be shared with the parent process, but in general alightweight process carries most of the state information of a process. Dispatch of lightweight processes is done in the kernel, and has the same

overhead as dispatching any process. The library support for statement-level parallelism is based on the use of

lightweight processes.

Process Creation


133/173

Process CreationThe process that creates another is called theparent process.The processes it creates arechildprocesses.The parent and its children together are a share group.

The fork () function is the traditional UNIX way of creating a process.The new process is a duplicate of the parent process, running in a

duplicate of the parent's address space.

Both execute the identical program text.A parent process should not terminate while its child processes continue torun.

Process Management


134/173

Process Management

When the parent process has nothing to do after starting the child processes, it

canloop on wait() until wait() reports no more children exist;then exit.

Sometimes it is necessary to handle child termination, and the parent cannotsuspend.

In this case the parent can treat the termination of a child process asanasynchronous event.

Parallelism in Real-Time Applications


135/173

Parallelism in Real-Time Applications

In real-time programs such as aircraft or vehicle simulators, separate

processes are used to divide the work of the simulation and distribute it ontomultiple CPUs.

In these applications, IRIX facilities (REACT Real-Time Programmer's Guide )can be used to:

reserve one or more CPUs of a multiprocessor for exclusive use by theapplication

isolate the reserved CPUs from all interruptsassign specific processes to execute on specific, reserved CPUs

The Frame Scheduler


136/173

The Frame Schedulerseizes one or more CPUs of a multiprocessor,isolates them,executes a specified set of processes on each CPU in strict rotation.

The Frame Scheduler has

much lower overhead than the normal IRIX scheduler,features designed for real-time work, including

odetection ofoverrun (when a scheduled process does not completeits work in the necessary time)

ounderrun (when a scheduled process fails to execute in its turn).

Paralelismo a nivel hilos


137/173

Paralelismo a nivel hilosA thread is an independent execution state within the context of a largerprogram; that is,

a set of machine registers,a call stack,the ability to execute code.

A program can create many threads to execute in the same address space.

There are two main reasons of using threads:portabilityperformance.

There are three key differences between a thread and a process:


138/173

There are three key differences between a thread and a process:

A UNIX process has its own set of UNIX state information, for

example, its own effective user ID and set of open file descriptors.Threads exist within a process and do not have distinct copies of these

state values. Threads share the single state belonging to their process.Each UNIX process has a unique address space that are accessible only

to that process.Threads within a process share the single address space belonging to

their process.Processes are scheduled by the kernel.Threads are scheduled by code that operates in the user address space,

without kernel assistance. Thread scheduling can be faster than processscheduling.

Threads:


139/173

Threads:

takes relatively little time to create or destroy, as compared to

creating a lightweight process.shares all resources and attributes of a single process (except for the

signal mask).

If you wanteach executing entity to have its own set of file descriptorsto make sure that one entity cannot modify data shared with another

entity

You must use lightweight processes or normal processes.

Threads cannot use these IPC mechanisms.


140/173

Threads can coordinate using mechanisms:

Unnamed semaphores for general coordination and resourcemanagement.

Message queues.Mutex objects, which allow threads to gain exclusive use of a shared

variable.

Condition variables, which allow a thread to wait when a controllingpredicate is false.

Semaphores, locks, and barriers to coordinate between multiple threads withina single program.

Mutexes:


141/173

ute es:A mutex is a software object that stands

for the right to modify some shared variable,

the right to execute a critical section of code.

A mutex can be owned by only one thread at a time; other threads trying to acquire itwait.

When a thread wants to modify a variable that it shares with other threads, or execute acritical section, the thread claims the associated mutex.

This can cause the thread to wait until it can acquire the mutex.

When the thread has finished using the shared variable or critical code, it releases themutex.If two or more threads claim the mutex at once,

one acquires the mutex and continues, the others are blocked until the mutex is released.

A mutex has attributes that control its behavior.

Condition Variables


142/173

Condition Variables

A condition variable provides a way in whicha thread can temporarily give up ownership of a mutex,wait for a condition to be true,then reclaim ownership of the mutex,

All in a single operation.

Preparing Condition Variables

Condition variables are supplied witha mechanism of attribute objectsstatic and dynamic initializers.A condition variable must be initialized before use.

Using Condition Variables


143/173

Using Condition Variables

A condition variable is a software object that represents a test of a Boolean

condition.

Typically the condition changes because of a software event such as"other thread has supplied needed data."

A thread that wants to wait for that event claims the condition variable,which causes it to wait.

The thread that recognizes the event signals the condition variable,releasing one or all threads that are waiting for the event.


144/173

A thread holds a mutex that represents a shared resource. While holding themutex, the thread finds that the shared resource is not complete or not ready.

The thread needs to do three things

Give up the mutex so that some other thread can renew the sharedresource.

Wait for the event that "resource is now ready for use."Re-acquire the mutex for the shared resource.

These three actions are combined into one using a condition variable.

When the event is signalled (or the time limit expires), the mutex is reacquired.

Paralelismo a nivel declaraciones


145/173

Statement-Level Parallelism.

The finest level of granularity.Statement-level parallel support is based on using common variables in memory,and so it can be used only within the bounds of a single-memory system.

The method of creating an optimized parallel program is as follows: Write a complete application that runs on a single processor. Completely debug and verify the correctness of the program in serial

execution. Apply the source analyzer. Add assertions to the source program. These are not explicit commands to

parallelize, but high-level statements that describe the program's use of data. Run the program on a single-memory multiprocessor.

Modelos de computacin distribuida


146/173

ode os de co pu ac d s bu da

Modelo MPI (Message-Passing-Interface)

MPI is a standard programming interface for the construction of a portable,parallel application in Fortran 77 or in C, especially when the application can bedecomposed into a fixed number of processes operating in a fixed topology (forexample, a pipeline, grid, or tree).

Modelo PVM (Parallel Virtual Machine)

PVM is an integrated set of software tools and libraries that emulates ageneral-purpose, flexible, heterogeneous, concurrent computing framework oninterconnected computers of varied architecture. Using PVM, you can create aparallel application that executes as a set of concurrent processes on a set ofcomputers that can include uniprocessors, multiprocessors, and nodes of Arraysystems.

Each is a formal, abstract model for distributing a computation across the nodes of a


147/173

multiple-memory system, without having to reflect the system configuration in thesource code.

Processes and threads allow you to execute in parallel within a single systemmemory. When the system memory is distributed among multiple independent machines,

your program must be built around a message-passing model.

In a message-passing model, your application consists of: multiple, independent processes, each with its own address space, running in possibly many different computers.

Each process shares data and coordinates with the others by passing messages.IRIX supports two libraries: Message-Passing Interface (MPI) Portable Virtual Machine (PVM).

Choosing Between MPI and PVM


148/173

g

MPI interface is a primary and preferred model for distributed applications.

In many ways, MPI and PVM are similar:

Each is designed, specified, and implemented by third parties that haveno direct interest in selling hardware.

Support for each is available over the Internet at low or no cost.Each defines portable, high-level functions that are used by a group of

processes to make contact and exchange data without having to be awareof the communication medium.

Each supports C and Fortran 77.Each provides for automatic conversion between different

representations of the same kind of data so that processes can be

distributed over a heterogeneous computer network.

MPI


149/173

The primary reason MPI is preferred is performance.The design of MPI is such that a highly optimized implementation

could be created for the homogenous environment.MPI applications take advantage to exchange data with small latencies

and high data rates.

Another difference between MPI and PVM is in the support for the"topology" (the interconnect pattern: grid, torus, or tree) of thecommunicating processes.

In MPI, the group size and topology are fixed when the group is created.This permits low-overhead group operations.In PVM, group composition is dynamic and causes more overhead in

common grouprelated operations.

Converting a PVM program into an MPI program


150/173

A large extent the library calls of MPI and PVM provide similar functionality

Some PVM calls do not have a counterpart in MPI, and vice versa. The semantics of some of the equivalent calls are inherently different for the two

libraries. The process of converting a can be complicated, depending on the particular PVM

calls and how they are used. PVM includes a console, which is useful for monitoring and controlling the states

of the machines in the virtual machine and the state of execution of a PVM job. The MPI standard does not provide mechanisms for specifying the initial

allocation of processes to an MPI computation and their binding to physicalprocessors.

The differences between PVM and MPI


151/173

The chief differences between the current versions of PVM and MPI libraries

are as follows:PVM supports dynamic creating of tasks, whereas MPI does not.PVM supports dynamic process groups; that is, groups whose

membership can change dynamically at any time during a computation.MPI does not support dynamic process groups.

The chief difference between PVM groups and MPI communicators is thatany PVM task can join/leave a group independently, whereas in MPI allcommunicator operations are collective.

A PVM task can add or delete a host from the virtual machine, therebydynamically changing the number of machines a program runs on. This isnot available in MPI.

The differences between PVM and MPI (2)


152/173

PVM provides two methods of signaling other PVM tasks:o

sending a UNIX signal to another task,onotifying a task about an event by sending it a message with a user-specified tag that the application can check.

oThese functions are not available in MPI. A task can leave from a PVM session as many times as it wants, whereas an MPI

task must initialize/finalize exactly once. A PVM task can be registered by another task as responsible for adding new PVM

hosts, or as a PVM resource manager, or as responsible for starting new PVMtasks.

These features are not available in MPI. A PVM task can multicast data to a set of tasks. As opposed to a broadcast, this

multicast does not require the participating tasks to be members of a group. MPIdoes not have a routine to do multicasts.


153/173

On the other hand, MPI provides several features that are not available inPVM, including

a variety of communication modes,communicators,derived data types,additional group management facilities,virtual process topologies, larger set of collective communication calls.

Programming models supported by PVM


154/173

Pure SPMD ProgramIn the SPMD program model, n instances of the same program arestarted as the n tasks of the parallel job, using the spawn command (or by

hand at each of the n hosts simultaneously).No tasks are dynamically spawned in the tasks.This scenario is essentially the same as the current MPI one where no tasks

are dynamically spawned.

General SPMD Model

In this model,n instances of the same program are executed as n tasks ofthe parallel job.

One or more tasks are started at the beginning, and these dynamicallyspawn the remaining tasks in turn.


155/173

MPMD ModelIn an MPMD programming model, one or more distinct tasks (having

different executables) are started by hand, and these tasks dynamicallyspawn other (possibly distinct) tasks.

Control del paralelismoS t


156/173

Segmentos

La memoria se manipula a travs de sus localidades de memoria, pero sudireccionamiento puede ser:

lineal (directo), segmentacin (datos, cdigo, pila y extras), descriptores globales y locales

(indican localidades compartidas por diferentes segmentos). paginacin (lineal + desplazamiento+tabla pginas + directorio de tablas) combinacin de segmentacin y paginacin (memoria virtual y niveles de

proteccin:

Kernel-0, Servicios del sistema-1, Extensiones Sist. Op. -2, Aplicaciones-3.)Kernel es el que tiene el mximo privilegio y la mayor proteccin.

Excepciones (instrucciones) Interrupciones (dispositivos).

ProcesosMPI l j j l d l t l d l li di d


157/173

MPI es el mejor ejemplo del control de paralelismo por medio de procesos.

SemforosObjetos de software usados para coordinar el acceso a recursos contados.

Lectores y escritoresSon objetos de software que marcan e interpretan una seal en cada proceso.

Secciones crticas

Cuellos de botella.

SincronizacinUna manera conveniente de sincronizar procesos paralelos en sistemas demultiprocesadores, son las barreras.

Comunicacin interprocesos


158/173

"Tipos de Comunicacin interprocesos

Compartir memoria entre procesos (memoria compartida) Exclusin Mutua, incluye semforos (semaphores), candados (locks), y similares Sealizacin de eventos Colas de mensajes (Message Queues), describe dos variedades de cola de

mensajes Bloqueo de archivo y grabado (File and Record Locking)

Una comunicacin Interproceso (IPC) es Cualquier coordinacin de las acciones de mltiples procesos Enviar datos de un procesador a otro

No se deben mezclar las implementaciones de un mecanismo dado en un programa

sencillo, pueden dar resultados impredecibles.


159/173

TENDENCIAS DEL CMPUTO PARALELOPROYECTO GRID


160/173

PROYECTO GRID

Is a single destination site for large-scale, non-profit research projects of globalsignificance. With the participation of over 3 million devices worldwide,grid.org projects like Cancer Research, Anthrax Research, Smallpox Researchand the new Human Proteome Folding Project (running in conjunction withIBM's new World Community Grid) have achieved record levels of processingspeed and success.

Grid.org projects are powered by United Devices' Grid MP technology, theleading solution for commercial enterprise grid deployments.

The basics


161/173

Grid computing is a form of distributed computing that involves coordinating

and sharing computing, application, data, storage, or network resources acrossdynamic and geographically dispersed organizations. Grid technologiespromise to change the way organizations tackle complex computationalproblems. However, the vision of large scale resource sharing is not yet areality in many areas Grid computing is an evolving area of computing,where standards and technology are still being developed to enable this new

paradigm.

Why is it important?


162/173

Time and Money. Organizations that depend on access to computational power

to advance their business objectives often sacrifice or scale back new projects,design ideas, or innovations due to sheer lack of computational bandwidth.Project demands simply outstrip computational power, even if an organizationhas significant investments in dedicated computing resources.

Even given the potential financial rewards from additional computational

access, many enterprises struggle to balance the need for additional computingresources with the need to control costs. Upgrading and purchasing newhardware is a costly proposition, and with the rate of technology obsolescence,it is eventually a losing one. By better utilizing and distributing existingcompute resources, Grid computing will help alleviate this problem.Delivering grid benefits today

Many companies want to take advantage of the cost and efficiency benefits thatf id i f t t t d ith t b i l k d i t t th t


163/173

come from a grid infrastructure today, without being locked in to a system thatwill not grow with their needs.

To provide customers the solution they need, United Devices tackled thecomplex security, scalability and unobtrusiveness issues required for a superiorenterprise grid, while building towards the open standards of the GGF. Byembracing these standards, United Devices lets our customers move towardcompatibility with future grid technologies and adopt upcoming technologies

as they are developed while delivering the promises and benefits of the gridtoday.

The Grid MP platform by United Devices works by amalgamating theunderutilized IT resources on a corporate network into a powerful enterprise


164/173

underutilized IT resources on a corporate network into a powerful enterprisegrid that can be shared by groups across the organization even

geographically disparate groups.

The most common corporate technology asset, desktop PCs, are also the mostunderutilized, often only using 10% of their total compute power even whenactively engaged in their primary business functions. By harnessing theseplentiful underused computing assets and leveraging them for revenue-driving

projects, the Grid MP platform provides immediate value for companies whowant to move forward with their grid strategies without limiting any future griddevelopments.

The benefits of building an enterprise grid with Grid MP platforminclude:


165/173

include:

Lower Computing Costs

On a price-to-performance basis, the Grid MP platform gets more work donewith less administration and budget than dedicated hardware solutions.Depending on the size of your network, the price-forperformance ratio forcomputing power can literally improve by an order of magnitude.

Faster Project Results

The extra power generated by the Grid MP platform can directly impact an


166/173

The extra power generated by the Grid MP platform can directly impact anorganization's ability to win in the marketplace by shortening product

development cycles and accelerating research and development processes.

Better Product Results

Increased, affordable computing power means not having to ignore promisingavenues or solutions because of a limited budget or schedule. The power

created by the Grid MP platform can help to ensure a higher quality product byallowing higher-resolution testing and results, and can permit an organizationto test more extensively prior to product release.Linux and Beowulf

Linux is an open-source, freely-available Unix kernel. It includes free compilers,

debuggers, editors, text processors, WWW servers, mail servers, SQL servers anda whole range of useful tools

Beowulf is a project that produced parallel Linux clusters with off-the-shelfhardware and freely available software


167/173

hardware and freely available softwareoUses inexpensive Intel xx86 based boxes (PCs)oOne or more networking methods (10/100 Ethernet, Myrinet, ATM, etc)oA fast network connectivity (hub, switch, gigabit switch or other)oLinux operating system (freely available Unix clone, includes full source

code)occ, f77, vi, emacs, perl, python and other free compilers and toolsoFree PVM or MPI implementations (MPICH, LAM) , etcoCommercial HPF implementations are available

PC-cluster style supercomputer, but much cheaper

Herramientas para Programacin ParalelaIntroduccin a los clusters Linux


168/173

Introduccin a los clusters Linux

La compaa de efectos especiales Weta Digital utiliz un cluster de 200servidores Linux para realizar animaciones computarizadas en la pelcula El

Seor de los Anillos. Weta Digital colabor con otros proyectos como Shreky Titanic.

La ventaja de usar Linux es el bajo costo y el cdigo abierto (open source) que

permite adaptar el sistema a las necesidades tcnicas.

Un cluster es el conjunto de procesadores fsicamente separados (no en elmismo gabinete) bajo un mismo sistema operativo, que en conjunto trabajanpara una misma aplicacin.


169/173

Granja de procesadores (FARM) e imagen de satlite procesada en paralelo.

A continuacin podemos ver otras aplicaciones donde fueron utilizadosclusters bajo sistema operativo Linux


170/173

clusters bajo sistema operativo Linux.

Plot 3-D simulation of space shuttle streamlining


171/173

Airflow design for performance and fuel efficiency


172/173


173/173

computo paralelo

Documents