computo paralelo
TRANSCRIPT
-
8/4/2019 Computo paralelo
1/173
Historia del Cmputo Paralelo
Inicia en el ao de 1955 con Gene Amdahl en los Estados Unidos,trabajando en la compaa IBM.
Que es el procesamiento en paralelo?
Es la divisin del trabajo en pequeas tareas.Asignar varias pequeas tareas a mltiples empleados para trabajar
simultneamente.
El procesamiento en paralelo es el uso de mltiples procesadores para
ejecutar diferentes partes del mismo programa simultneamente.
-
8/4/2019 Computo paralelo
2/173
Dificultades:
coordinacin,controlar ymonitorear trabajadores.
Las principales metas del procesamiento en paralelo son:
Resolver problemas ms grandes ms rpido.Reducir el tiempo de ejecucin de los programas de cmputo.Incrementar el tamao de los problemas computacionales que se pueden
resolver.
En la actualidad existen varias lneas de investigacin dentro del cmputoparalelo, as como tambin diversos equipos, lenguajes y aplicaciones.
-
8/4/2019 Computo paralelo
3/173
Ventajas
La idea de crear el cmputo paralelo surge de las siguientes necesidades:oefectuar operaciones mucho ms rpido,oprocesar grandes volmenes de informacin yoobtener resultados en el menor tiempo posible
Estas necesidades son el por qu y a la vez las principales metas delcmputo paralelo.
Definicin. Computadora paralela:
Es una coleccin de procesadores interconectados de alguna forma dentro deun mismo gabinete, para intercambiar informacin.
-
8/4/2019 Computo paralelo
4/173
Taxonoma (del griego , taxis, "ordenamiento", y , nomos, "norma"o "regla"). Significa ciencia de la clasificacin.
La clasificacin de computadoras debe ser en base a las caractersticas msnotorias y no en las ms detalladas que aparecen en las hojas de datos (datasheets). Existen varias taxonomas o clasificaciones de computadoras, como lade Skillicorn, la de Shore (6 tipos), la de Handler, y la taxonoma estructural deHockney and Jesshope. La ms importante de estas clasificaciones es la de
Flynn.
Taxonoma de Flynn: clasifica las computadoras (su arquitectura y sussistemas operativos) de acuerdo a la capacidad de un sistema de procesar:uno o ms flujos simultneos de datos (data streams).
-
8/4/2019 Computo paralelo
5/173
uno o ms flujos simultneos de instrucciones (instruction stream).
SISD: un flujo de instrucciones nico trabaja sobre flujo de datos nico(arquitectura clsica, superescalares).
-
8/4/2019 Computo paralelo
6/173
SIMD: un flujo de instrucciones nico trabaja sobre un flujo de datosmltiple (computadores matriciales).
MISD: un flujo de instrucciones mltiple trabaja sobre un flujo. de datosnico (clase no implementada, resultado de la clasificacin)
MIMD: un flujo de instrucciones mltiple trabaja sobre un flujo de datosmltiple (multiprocesadores).
-
8/4/2019 Computo paralelo
7/173
Paradigmas del cmputo paralelo.Paradigmas del software paralelo.
Existen varios mtodos de programar computadoras paralelas. Los dos mscomunes son:
paso de mensajes y
paralelismo de datos.
Paso de mensajes (message passing)el usuario hace llamado a libreraspara especficamente compartir informacin entre procesadores.
Paralelismo de datos (data parallel) la particin de datos determina el
paralelismo.Memoria compartida (shared memory) multiples procesadorescomparten un espacio de memoria comn.
-
8/4/2019 Computo paralelo
8/173
Operacin remota de memoria (remote memory operation). Un conjunto deprocesadores en los cuales un proceso puede accesar la memoria de otro
proceso sin su participacin.Hilos ( threads). Un slo proceso tiene mltiples (y concurrentes) rutas de
ejecucin.Modelos combinados (combined models). Compuestos de dos o ms
modelos de los mencionados arriba.
Nota: estos modelos son independientes de la mquina/arquitectura; cualquierade los modelos puede ser implementado en cualquier hardware con el soportede un sistema operativo apropiado. Una implementacin efectiva es aquellaque se acerque ms al modelo de hardware y le d al usuario facilidad en laprogramacin.
-
8/4/2019 Computo paralelo
9/173
Message PassingThe message passing model is defined as:
set of processes using only local memory.
processes communicate by sending and receiving messages. data transfer requires cooperative operations to be performed by each
process (a send operation must have a matching receive).Programming with message passing is done by linking with and making calls tolibraries which manage the data exchange between processors. Message passinglibraries are available for most modern programming languages.
Data Parallel
The data parallel model is defined as: Each process works on a different part of the same data structure
Commonly a Single Program Multiple Data (SPMD) approach
Data is distributed across processors All message passing is done invisibly to the programmer Commonly built "on top of" one of the common message passing libraries
Programming with data parallel model is accomplished by writing a program withdata parallel constructs and compiling it with a data parallel compiler.
-
8/4/2019 Computo paralelo
10/173
Clasificacin del Paralelismo.
Temporal (pipeline).oUn programa se ejecuta de manera secuencial y en cierto momento se
dividen las tareas a varias unidades de procesamiento. Al terminar laejecucin en cada unidad, de nuevo se retoma la ejecucin secuencial.
EspacialoEl paralelismo espacial se produce cuando se tiene varios procesadores
y se puede ejecutar un proceso en cada uno de ellos de forma ms omenos independiente. En el caso ptimo, el tiempo de ejecucin sedivide por el nmero de procesadores que estn trabajando.
IndependienteoEste paralelismo no depende de la topologa de la red de procesadores,
ya que el programa no adapta su estructura al de las conexiones de lared, es decir, no importa cmo estn conectados los procesadores, laejecucin del programa en paralelo se realiza.
-
8/4/2019 Computo paralelo
11/173
Niveles de paralelismo:
Existen dos cualidades para la programacin en paralelo.Granularidad: Es el tamao relativo de la unidad de cmputo que ejecuta
en paralelo. Esta unidad puede ser una declaracin, una funcin o unproceso completo.
Canal de comunicacin: Es el mecanismo bsico por el cual las unidadesindependientes del programa intercambian datos y sincronizan su
actividad.
Nivel de declaraciones (statement).oEs el nivel mas fino de granularidad.oSe utiliza en lenguajes como Power C, Fortran 79/90, Power Fortran
79/90.oSe usan variables en comn dentro de un sistema nico de memoria.
-
8/4/2019 Computo paralelo
12/173
Nivel de hilos (thread)oUn hilo es un estado independiente de ejecucin, dentro del contexto
de un programa ms grande, esto es:Un conjunto de registros mquinaUna pila (stack) de llamadasLa habilidad de ejecutar cdigo.
oUn programa puede crear varios hilos para ejecutarse en el mismoespacio de direccin. Las razones de usar hilos son la portabilidad y el
desempeo. Se comparten los recursos de un solo procesador.Nivel de procesos (process).
oUn proceso en UNIX consiste en:Un espacio de direccinUn gran nmero de valores de estado de proceso.Un hilo de ejecucin.
-
8/4/2019 Computo paralelo
13/173
oEl mecanismo para comunicacin entre procesos puede usarse paraIntercambiar datos Coordinar las actividades de mltiples procesos
asncronos.oUn proceso puede crear uno o ms procesos, el proceso que crea otro
se llama proceso padre y el creado se llama proceso hijo. El procesoinicial se llama raz.
-
8/4/2019 Computo paralelo
14/173
Desempeo.
El desempeo de un programa en paralelo se mide bajo tres trminos:
Aceleracin (Speed up) Eficiencia Costo
Donde:
T = tiempo en segundosT1= Tiempo usando una unidad de procesamientoTp = Tiempo usando dos o ms de unidades de procesamientoP = nmero de unidades de procesamiento
-
8/4/2019 Computo paralelo
15/173
Ley de Amdahl.La ley de Amdahl se refiere a la aceleracin de usar procesadores en
paralelo en un problema a comparacin de usar solo un procesador serial.Para entender lo que es aceleracin, primero veremos que es velocidad:
oLa velocidad de un programa es el tiempo que le toma para serejecutado. Esto puede ser medido en cualquier incremento detiempo.
oLa aceleracin (speedup) se define como el tiempo que le toma a un
programa ejecutarse en serial (con un procesador) dividido por eltiempo que le toma ejecutarse en paralelo (con varios procesadores).
-
8/4/2019 Computo paralelo
16/173
-
8/4/2019 Computo paralelo
17/173
Arquitecturas Paralelas.
Criterios Antiguos:
Arquitecturas paralelas ligadas a modelos de programacin.Arquitecturas divergentes, sin ningn patrn de crecimiento.
-
8/4/2019 Computo paralelo
18/173
Arquitecturas Paralelas.
Criterios actuales:
Extensin de la arquitectura de computadores para soportarcomunicaciones y cooperacin.oANTES: Conjunto de instrucciones.oAHORA: Comunicaciones.
Hay que definir:oAbstracciones, fronteras, primitivas (interfaces)oEstructuras que implementan los interfaces (hw o sw)
Compiladores, libreras y OS son cuestiones importantes en nuestros das.
-
8/4/2019 Computo paralelo
19/173
Arquitecturas Paralelas.
Recordemos que:
Un computador paralelo es un conjunto de elementos de proceso que secomunican y cooperan para resolver rpidamente grandes problemas.
Podemos decir que la Arquitectura Paralela es:
Arquitectura convencional+
Arquitectura de comunicacin
-
8/4/2019 Computo paralelo
20/173
-
8/4/2019 Computo paralelo
21/173
Arquitecturas Paralelas.
Modelos de programacinEspecifica las comunicaciones y la sincronizacin.Ejemplos:
oMultiprogramacin: no hay comunicacin o sincronismo. Paralelismoa nivel de programa.
oMemoria compartida: como un tabln de anuncios.oPaso de mensajes: como cartas o llamadas telefnicas, punto a punto.oParalelismo de datos: varios agentes actan sobre datos individuales y
luego intercambian informacin de control antes de seguir el proceso.El intercambio se implementa con memoria compartida o con paso
de mensajes.
-
8/4/2019 Computo paralelo
22/173
Arquitecturas Paralelas.
Niveles de abstraccin en la comunicacin.
-
8/4/2019 Computo paralelo
23/173
Arquitecturas Paralelas.
Evolucin de los modelos arquitectnicos:
Modelo de programacin, comunicacin y organizacin de la mquinacomponen la arquitectura.Espacio de memoria compartida.
Paso de mensajes.Paralelismo de datos.Otras:
Flujo de datos.Arrays sistlicos.
-
8/4/2019 Computo paralelo
24/173
Memoria Compartida.
Cualquier procesador puede referenciar directamente cualquier posicin dememoria.oLa comunicacin se realiza implcitamente por medio de cargas y
almacenamientos.
Ventajas:oLocalizacin transparente.oProgramacin similar a tiempo compartido en uniprocesadores.
Excepto que los procesos se ejecutan en diferentes procesadores.Buen rendimiento en distribucin de carga.
-
8/4/2019 Computo paralelo
25/173
-
8/4/2019 Computo paralelo
26/173
Memoria Compartida.
-
8/4/2019 Computo paralelo
27/173
Memoria Compartida.
Hardware de Comunicacin:
Aumento independiente de capacidades de memoria, de I/O o de procesoaadiendo mdulos, controladores o procesadores.
-
8/4/2019 Computo paralelo
28/173
Memoria Compartida.
Estrategia de comunicaciones en Mainframe:
Red de barras cruzadas.Inicialmente limitado por el coste de los procesadores. Despus, por el
coste de la red.El ancho de banda crece conp.Alto coste de ampliacin; uso de redes multietapa.
-
8/4/2019 Computo paralelo
29/173
Memoria Compartida.
Estrategia de comunicaciones en Minicomputer:
Casi todos los sistemas con microprocesadores usan bus.Muy usados para computacin paralela.Llamados SMP, symmetric multiprocessor.El bus puede ser un cuello de botella.Problema de la coherencia en cache.Bajo coste de ampliacin.
-
8/4/2019 Computo paralelo
30/173
Memoria Compartida.
Ejemplo: Intel Pentium Pro Quad. Coherencia y multiproceso integrados en el modulo procesador.
-
8/4/2019 Computo paralelo
31/173
Memoria Compartida.
Ejemplo: SUN Enterprise. 16 tarjetas de cualquier tipo:procesadores + memoria, o I/O. El acceso a la memoria es por bus, simtrico.
-
8/4/2019 Computo paralelo
32/173
Memoria Compartida.
Otras opciones en comunicacin:
Problemas de interconexin: coste (barras cruzadas) o ancho de banda (bus). Dance-hall: ampliable a menor coste que en barras cruzadas.
oLatencia en acceso a memoria uniforme, pero alta. NUMA (non-uniform memory access):
oConstruccin de un simple espacio de memoria con latencias diferentes. COMA: (Cache Only Memory Architecture)
oArquitectura de memoria a base de caches compartidas.
-
8/4/2019 Computo paralelo
33/173
Memoria Compartida.
Ejemplo: Cray T3E Ampliable a 1024 procesadores, enlaces de 480MB/s.
El controlador de memoria genera las peticiones para posiciones no locales. No tiene mecanismo de hardware para coherencia (SGI Origin y otros s lo
proporcionan).
-
8/4/2019 Computo paralelo
34/173
Paso de Mensajes.
Construidos por medio de computadores completos, incluyendo I/O.oComunicacin por medio de operaciones explcitas de I/O.
Modelo de programacin: acceso directo slo a direcciones privadas(memoria local), comunicacin por medio de mensajes (send/receive)
Diagrama de bloques similar al NUMAoPero las comunicaciones se integran a nivel de I/O.oComo redes de workstations (clusters), pero mayor integracin.oMs fciles de construir y ampliar que los sistemas NUMA.
Modelo de programacin menos integrado en el hardware.oLibreras o intervencin del sistema operativo.
-
8/4/2019 Computo paralelo
35/173
Paso de Mensajes.
send especifica el buffer a transmitir y el proceso receptor.recv especifica el proceso emisor y el bufferde almacenamiento.Son copias memoria-memoria, pero se necesitan los nombres de procesos.Opcionalmente se puede incluir el destino en el envo y unas reglas de
identificacin en el destino.En la forma simple, el emparejamiento se consigue por medio de la
sincronizacin de sucesos send/recv.oExisten mltiples variantes de sincronizacin.
Grandes sobrecargas: copia, manejo de buffer, proteccin.
-
8/4/2019 Computo paralelo
36/173
Paso de Mensajes.
-
8/4/2019 Computo paralelo
37/173
Paso de Mensajes.Evolucin en las mquinas de Paso de Mensajes
Primeras mquinas: FIFO en cada enlace.oModelo de programacin muy prximo
al hw; operaciones simples desincronizacin.
oReemplazado por DMA, permitiendooperaciones no bloqueantes.Buffer de almacenamiento en
destino hasta recv.Disminucin de la influencia de la topologa
(enrutado por hw).oStore&forward routing: importa la topologa.oIntroduccin de redes multietapa.oMayor coste: comunicacin nodo red.oSimplificacin de la programacin
-
8/4/2019 Computo paralelo
38/173
Paso de Mensajes.
Ejemplo: IBM SP-2. Realizado a base de estaciones RS6000.
-
8/4/2019 Computo paralelo
39/173
Paso de Mensajes.
Ejemplo Intel Paragon.
-
8/4/2019 Computo paralelo
40/173
La convergencia de las arquitecturas.
La evolucin y el papel del software ha difuminado las fronteras entrememoria compartida y paso de mensajes.osend/recv soporta memoria compartida va buffers.oSe puede construir un espacio global de direcciones en Paso de
Mensajes.
Tambin converge la organizacin del hardware.oMayor integracin para Paso de Mensajes (menor latencia, mayor
ancho de banda)oA bajo nivel, algunos sistemas de memoria compartida implementan
paso de mensajes en hardware.
Distintos modelos de programacin, pero tambin en convergencia.
-
8/4/2019 Computo paralelo
41/173
Interconexin de sistemas paralelos
La misin de la red en una arquitectura paralela es transferir informacindesde cualquier fuente a cualquier destino minimizando la latencia y concoste proporcionado.
La red se compone de:onodos;oconmutadores;oenlaces.
La red se caracteriza por su:otopologa: estructura de la interconexin fsica;oenrutado: que determina las rutas que los mensajes pueden o deben
seguir en el grafo de la red;oestrategia de conmutacin: de circuitos o de paquetes;ocontrol de flujo: mecanismos de organizacin del trfico.
-
8/4/2019 Computo paralelo
42/173
Interconexin de sistemas paralelos
Clasificacin de las redes por su topologa.
Estticas:
oconexiones directas estticas punto a punto entre los nodos;ofuerte acoplamiento interfaz de red-nodo;olos vrtices del grafo de la red son nodos o conmutadores;
Se clasifican a su vez:
simtricas: anillo, hipercubo, toro;no simtricas: bus, rbol, malla.
-
8/4/2019 Computo paralelo
43/173
Interconexin de sistemas paralelos
Clasificacin de las redes por su topologa.
Dinmicas:olos conmutadores pueden variar dinmicamente los nodos que
interconectan.
Se clasifican a su vez:monoetapa;multietapa:
bloqueante (lnea base, mariposa, baraje);reconfigurable (Bene);no bloqueante (Clos).
-
8/4/2019 Computo paralelo
44/173
Interconexin de sistemas paralelos
Parmetros caractersticos de una red:
Tamao de la red: nmero de nodos que la componen.
Grado de un nodo: nmero de enlaces que inciden en el nodo.
Dimetro de la red: es el camino mnimo ms largo que se puedeencontrar entre dos nodos cualesquiera de la red.
Simetra: una red es simtrica si todos los nodos son indistinguibles desdeel punto de vista de la comunicacin.
-
8/4/2019 Computo paralelo
45/173
Redes estticas
-
8/4/2019 Computo paralelo
46/173
Redes estticas
-
8/4/2019 Computo paralelo
47/173
Redes estticasHipercubo 3D ciclo-conexo
-
8/4/2019 Computo paralelo
48/173
Redes estticasEjemplo de conexiones en un hipercubo 3
Conexin de nodos que se diferencian en el bit menos significativo
Conexin de nodos que se diferencian en el segundo bit
Conexin de nodos que se diferencian en el bit ms significativo
-
8/4/2019 Computo paralelo
49/173
Redes dinamicas
Redes dinmicas: son redes cuya configuracin puede modificarse.
Hay dos tipos:omonoetapa.omultietapa.
Las redes monoetapa realizan conexiones entre elementos de procesoen una sola etapa.oPuede que no sea posible llegar desde cualquier elemento a cualquier
otro, por lo que puede ser necesario recircular la informacin (=>redesrecirculantes)
Las redes multietapa realizan conexiones entre los elementos deproceso en ms de una etapa.
-
8/4/2019 Computo paralelo
50/173
Redes dinmicas
Redes de interconexin monoetapa
-
8/4/2019 Computo paralelo
51/173
Redes dinmicas
Red de barras cruzadas: permite cualquier conexin.
-
8/4/2019 Computo paralelo
52/173
Redes dinmicasRedes de interconexin (multietapa)
Cajas de conmutacin
Las cuatro configuraciones posibles de una caja de conmutacin de 2entradas.
-
8/4/2019 Computo paralelo
53/173
Redes dinmicas bloqueantes
Redes multietapa bloqueantes.
oSe caracterizan porque no es posible establecer siempre una nuevaconexin entre un par fuente/destino libres, debido a conflictos con lasconexiones en curso.
oGeneralmente existe un nico camino posible entre cada parfuente/destino.
-
8/4/2019 Computo paralelo
54/173
Redes dinmicas bloqueantes
Red de lnea base:
-
8/4/2019 Computo paralelo
55/173
Redes dinmicas bloqueantes
Red mariposa:
-
8/4/2019 Computo paralelo
56/173
Redes dinmicas bloqueantes
Red baraje perfecto:
-
8/4/2019 Computo paralelo
57/173
Redes dinmicas reconfigurables
Redes multietapa reconfigurables.
oSe caracterizan porque es posible establecer siempre una nuevaconexin entre un par fuente/destino libres, aunque haya conexiones encurso, pero puede hacerse necesario un cambio en el camino usado poralguna(s) de ellas (reconfiguracin).
oInteresante en procesadores matriciales, en donde se conocesimultneamente todas las peticiones de interconexin.
-
8/4/2019 Computo paralelo
58/173
Redes dinmicas reconfigurables
Red de Bene:
-
8/4/2019 Computo paralelo
59/173
Redes dinmicas reconfigurables
La red de Bene se puede construir recursivamente:
-
8/4/2019 Computo paralelo
60/173
Redes dinmicas no bloqueantes
Redes dinmicas no bloqueantes.oSe caracterizan porque es posible establecer siempre una nueva
conexin entre un par fuente/destino libres sin restricciones.
oSon anlogas a los conmutadores de barras cruzadas, pero pueden
presentar mayor latencia, debido a las mltiples etapas.
-
8/4/2019 Computo paralelo
61/173
Redes dinmicas no bloqueantes
Red de Clos:
-
8/4/2019 Computo paralelo
62/173
Coherencia en Memoria Cache
Estructuras comunes de la jerarqua de memoria en multiprocesadoresMemoria cache compartida.
Memoria compartida mediante bus.
Interconexin por medio de red (dance-hall)
Memoria distribuida.
-
8/4/2019 Computo paralelo
63/173
Coherencia en Memoria Cache
Cache compartidaPequeo nmero de procesadores
(28)Fue comn a mediados de los
80 para conectar un par de
procesadores en placa.Posible estrategia en chip
multiprocesadores.
-
8/4/2019 Computo paralelo
64/173
Coherencia en Memoria Cache
Comparticin por medio de bus.
Ampliamente usada enmultiprocesadores de pequeay mediana escala (20-30)
Forma dominante en lasmquinas paralelas actuales.
Los microprocesadoresmodernos estn dotados parasoportar protocolos decoherencia en estaconfiguracin.
-
8/4/2019 Computo paralelo
65/173
Coherencia en Memoria Cache
Saln de baile
Fcilmente escalable.
Estructura simtrica UMA.Memoria demasiado lejana
especialmente en grandes sistemas.
-
8/4/2019 Computo paralelo
66/173
Coherencia en Memoria Cache
Memoria distribuidaEspecialmente atractiva par multiprocesadores escalables.Estructura no simtrica NUMA.Accesos locales rpidos.
-
8/4/2019 Computo paralelo
67/173
Arquitecturas paralelas de computadoras.
Tipos de organizacin de procesadores: Existen 7 importantes mtodos deorganizacin de procesadores para conectarprocesadores en una computadoraparalela.
Redes de Rejilla (Mesh Networks).Redes de rbol binario (Binary Tree Networks).
Redes de hiperrbol (Hypertree Networks).Redes de Pirmide (Pyramid Networks).Redes de mariposa (Butterfly Network).Redes de hipercubo (Hypercube Networks).Redes de ciclos de cubos conectados o cubo en ciclos (Cube-Connected
Cycles Networks).
-
8/4/2019 Computo paralelo
68/173
A la organizacin de procesadores tambin se le conoce como topologa, y deacuerdo a sta, existen criterios para evaluar cul modelo es ms conveniente
que otro. Las aplicaciones tambin pueden definir si un modelo conviene o no.Criterios de evaluacin de modelos:
Dimetro.Ancho de biseccin.
Nmero de aristas por nodo.Mxima longitud de aristas.Grado de una arquitectura.
-
8/4/2019 Computo paralelo
69/173
Mesh Networks.
Nodes are arranged into a q-dimensional lattice.Communication is allowed only between neighboring nodesTwo-dimensional meshes.
oMesh with no wrap-around connections.oMesh with wrap-around connections between processors in same
row or column.
oMesh with wrap-around connections between processors in adjacentrows or columns.oWrap-around connections can connect processors in the same row or
column adjacent rows or columns.
-
8/4/2019 Computo paralelo
70/173
Evaluation of the Mesh: Interior nodes communicate with2q other processors The diameter of a q-dimensional mesh withkq nodes is q(k - 1) The bisection width of a q-dimensional mesh withkq nodes iskq-1. The maximum number of edges per node is2q. The maximum edge length is a constant, independent of the number of nodes, for two- and three-
dimensional meshes. The two-dimensional mesh has been apopular topology for processor arrays.
o Goodyear Aerospace's MPPo AMT DAPo MasPar's MPo The Intel Paragon XP/S multicomputer connects processors with a two dimensional mesh.
-
8/4/2019 Computo paralelo
71/173
-
8/4/2019 Computo paralelo
72/173
-
8/4/2019 Computo paralelo
73/173
Hypertree NetworksAn approach
to building a network with the low diameter of a binary treewith an improved bisection width.The easiest way to think of a hypertree network of degree k and depth dis toconsider the network from two different angles.
From the front a hypertree network of degreek and depthdlooks like acompletek-ary tree of heightd.
From the side, the same hypertree network looks like an upside downbinary tree of heightd.Joining the front and side views yields the complete network.
-
8/4/2019 Computo paralelo
74/173
Hypertree evaluation
4-ary hypertree with depthdhas
4dleaves2d (2d+1 - 1) nodes.diameter is2d.bisection width is2d+1.number of edges per node is never more than six.maximum edge length is an increasing function of the problem size.
-
8/4/2019 Computo paralelo
75/173
Hypertree network of degree 4 and depth 2(a) Front view(b) Side view(c) Complete network.Connection Machine CM-5 multicomputer is a4-ary hypertree.
-
8/4/2019 Computo paralelo
76/173
Pyramid Networks
An attempt to combine advantages ofmesh networkstree networks.
A pyramid network of sizek2 isa complete4-ary rooted tree of height log2 kaugmented with additional interprocessor links
the processors in every tree level form a 2-D mesh networkA pyramid of size k2 has at its base a 2-D mesh network containing k2processors.
The total number of processors in a pyramid of sizek2 is (4/3)k2 - (1/3).The levels of the pyramid are numbered in ascending order.The base has level number 0, and the single processor at the apex of the
pyramid has level number log2 k.
-
8/4/2019 Computo paralelo
77/173
Pyramid network of size 16.
Every interior processor is connected to nine other processors:one parent,four mesh neighbors,four children.
-
8/4/2019 Computo paralelo
78/173
-
8/4/2019 Computo paralelo
79/173
Pyramidevaluation
The advantage of the pyramid over the 2-D mesh isThe pyramid reduces the diameter of the network.
When a message must travel from one side of the mesh to the other, fewer linktraversals are required if the message travels up and down the tree rather thanacross the mesh.
The diameter of a pyramid of sizek2 is2log k.
The addition of tree links does not give a significantly higher bisectionwidth than a 2-D mesh.The bisection width of a pyramid of sizek2 is2k.The maximum number of links per node is no greater than nine,
regardless of the size of the network.Unlike a 2-D mesh, the length of the longest edge is an increasing
function of the network size.
-
8/4/2019 Computo paralelo
80/173
Butterfly Network
Consists of (k+1)2k nodes divided into k+1 rows, or ranks, eachcontainingn=2k nodes.
The ranks are labeled 0 throughk.
The ranks 0 and k are sometimes combined, giving each node four
connections to other nodes.
-
8/4/2019 Computo paralelo
81/173
-
8/4/2019 Computo paralelo
82/173
Node connection
Letnode(i, j) refer to thejth node on the ith rank, where 0
-
8/4/2019 Computo paralelo
83/173
Butterflyevaluation
As the rank numbers decrease, the widths of the wings of the butterfliesincrease exponentially.
The length of the longest network edge increases as the number ofnetwork nodes increases.
The diameter of a butterfly network with (k + 1)2k nodes is2k.The bisection width is2k-1.
A butterfly network serves to route data from non local memory to processorson the BBN TC2OOO multiprocessor.
-
8/4/2019 Computo paralelo
84/173
Hypercube
A cube-connected network, also called a binary n-cube network, is abutterfly with its columns collapsed into single nodes.
Consists of2k nodes forming ak-dimensional hypercube.The nodes are labeled 0, 1 2k-1;Two nodes are adjacent if their labels differ in exactly one bit position.
-
8/4/2019 Computo paralelo
85/173
A four-dimensional hypercube.
-
8/4/2019 Computo paralelo
86/173
Hypercube evaluation
The diameter of a hypercube with2k nodes iskThe bisection width of that size network is2k-1,The hypercube organization has low diameterHigh bisection width at the expense of the number of edges per node and
the length of the longest edge.The number of edges per node isk-the logarithm of the number of nodes
in the network.
The length of the longest edge in a hypercube network increases as thenumber of nodes in the network increases.
-
8/4/2019 Computo paralelo
87/173
Cube-Connected Cycles Networks
The cube-connected cycles network is ak-dimensional hypercube
2k "vertices" are actually cycles ofk nodes.For each dimension, every cycle has a node connected to a node in the
neighboring cycle in that dimension.
-
8/4/2019 Computo paralelo
88/173
24-node cube-connected cycles network.
-
8/4/2019 Computo paralelo
89/173
Cycles hypercube evaluation
Node(i, j) is connected tonode(i, m)
if and only ifm is the result of inverting the ith most significant bit of thebinary representation ofj.
Compared to the hypercube,
the cube-connected cycles processor organization has the advantage,the number of edges per node is three - a constant independent of
network size.
-
8/4/2019 Computo paralelo
90/173
Disadvantage
the diameter is twice that of a hypercubethe bisection width is lower.
Given a cube-connected cycles network of sizek2k,oits diameter is2k
its bisection width is2k-1
-
8/4/2019 Computo paralelo
91/173
CHARACTERISTICS OF VARIOUS PROCESSOR ORGANIZATIONS
-
8/4/2019 Computo paralelo
92/173
Flujo de datos y paralelismo implcitoVon Neumann vs. Paralelo
Parallel random access machine (PRAM) Provides a mental break from the Von Neumann model and sequential algorithms. PRAM (pronounced "pea ram") model of parallel computation. Allows parallel-algorithm designers to treat processing power as an unlimited resource Unrealistically simple; Ignores the complexity of interprocessor communication.
The designer of PRAM algorithms can focus on the parallelism inherent in a particularcomputation. Cost-optimal PRAM solutions exist, meaning that the total number of operations
performed by the PRAM algorithm is of the same complexity class as an optimalsequential algorithm.
Cost-optimal PRAM algorithms can serve as a foundation for efficient algorithms onreal parallel computers.
-
8/4/2019 Computo paralelo
93/173
Modelos computacionales RAM y PRAM
RAM, model of Serial ComputationThe random access machine (RAM) is a model of a one-address computer.
Consists of:memoryread-only input tape
write-only output tape program
-
8/4/2019 Computo paralelo
94/173
-
8/4/2019 Computo paralelo
95/173
RAM programThe program
not stored in memory can not be modified.
The input tape contains a sequence of integers. Every time an input value is read, the input head advances one square. The output head advances after every write. Memory consists of an unbounded sequence of registersr0, r1, r2, . Each register can hold a single integer. Registerr0 is the accumulator, where
computations are performed.
The exact instructions are not important, as long as they resemble the instructionsfound on an actual computer.
load, store, read, write, add, subtract, multiply, divide, test, jump, and halt.
-
8/4/2019 Computo paralelo
96/173
RAM Time Complexity
The worst-case time complexity of a RAM program is the functionf(n) maximum time taken by the program to execute over all inputs of sizen.
The expected time complexity of a RAM program is the average, over all inputs of sizen, of the execution times.
Analogous definitions hold for worst-case space complexity expected space complexity.
There are two ways of measuring time and space on the RAM model. uniform cost criterion
logarithmic cost criterion.
-
8/4/2019 Computo paralelo
97/173
Cost criterion
The uniform cost criterion sayseach RAM instruction requires one time unit to executeevery register requires one unit of space.
The logarithmic cost criterion takes into account that an actual word ofmemory has a limited storage capacity.
The uniform cost criterion is appropriate if the values manipulated by theprogram always fit into one computer word.
-
8/4/2019 Computo paralelo
98/173
The PRAM model of parallel computation
A PRAM consists of acontrol unitglobal memoryan unbounded set of processors each with its own private memoryActive processors execute identical instructionsEvery processor has a unique index
The value of a processor's index can be used to enable or disable theprocessorInfluence which memory location it accesses
-
8/4/2019 Computo paralelo
99/173
-
8/4/2019 Computo paralelo
100/173
PRAM computationA PRAM computation begins with
input stored in global memorysingle active processing element.
During each step, an active enabled processorread a value from a single private or global memory locationperform a single RAM operation
write into one local or global memory location.may activate another processor.
The processors are synchronized.All active, enabled processors must execute the same instruction, on
different memory locations.
The computation terminates when the last processor halts.
-
8/4/2019 Computo paralelo
101/173
Modelos de PRAM differ in how they handle read or write conflicts; when two or more processors attempt to read from, or write to, the same global
memory location.
1 EREW (Exclusive Read Exclusive Write): Read or write conflicts are not allowed.
2 CREW (Concurrent Read Exclusive Write):
Concurrent reading allowed; i.e., multiple processors may read from the sameglobal memory location during the same instruction step.Write conflicts are not allowed. (This is the default PRAM model.)
3 CRCW (Concurrent Read Concurrent Write): Concurrent reading and con-current writing allowed. A variety of CRCW models exist with different policies for handling concurrent
writes to the same global address.
-
8/4/2019 Computo paralelo
102/173
Tipos de PRAM CRCWThree different models:
Common Arbitrary Priority
Various CRCW PRAM1.Common. All processors concurrently writing into the same global address must be
writing the same value.
2.Arbitrary. If multiple processors concurrently write to the same global address, one of the
competing processors is arbitrarily chosen as the "winner," and its value iswritten into the register.
3.Priority. If multiple processors concurrently write to the same global address, the
processor with the lowest index succeeds in writing its value into the memorylocation.
-
8/4/2019 Computo paralelo
103/173
Strengths of PRAM models
EREW PRAMmodel is the weakest.Clearly a CREW PRAM can execute any EREW PRAM algorithm in the
same amount of time; the concurrent read facility is simply not used.
CRCW PRAMcan execute any CREW PRAMalgorithm in the same amount oftime.
PRIORITY PRAMmodel is the strongest.
Any algorithm designed for the COMMON PRAM model will executewith the same complexity on the ARBITRARY PRAM and PRIORITYPRAM models.
-
8/4/2019 Computo paralelo
104/173
Strengths of PRAM models
if all processors writing to the same location write the same value,choosing an arbitrary processor would cause the same result.
if an algorithm executes correctly when an arbitrary processor is chosenas the "winner," the processor with the lowest index is as reasonable analternative as any other.
Any algorithm designed for the ARBITRARY PRAMmodel will execute withthe same time complexity on the PRIORITY PRAMmodel.
Because the PRIORITY PRAM model is stronger than the EREW PRAMmodel, an algorithm to solve a problem on the EREW PRAM can have highertime complexity tan an algorithm solving the same problem on the PRIORITY
PRAM model.
-
8/4/2019 Computo paralelo
105/173
Increase in parallel time complexity
The increase in parallel time complexity can occur when moving from thePRIORITY PRAM model to the EREW PRAM model.
Lemma. A p-processor EREW PRAM can sort a p-element array stored inglobal memory in (log p) time.
Theorem. Ap-processor PRIORITY PRAM can be simulated by a p-processorEREW PRAM with the time complexity increased by a factor of(log p).
-
8/4/2019 Computo paralelo
106/173
Simulation PRIORITY PRAM by EREW PRAM
Assume the PRIORITY PRAM algorithm usesprocessors P1, P2 Ppglobal memory locationsM1, M2 Mm.
The EREW PRAM usesauxiliary global memory locations T1, T2 Tp and S1, S2 Sp to simulate
each read or write step of the PRIORITY PRAM.When processor Pi in the PRIORITY PRAM algorithm accesses memorylocation Mj, processor Pi in the EREW PRAM algorithm writes the orderedpair (j,i) in memory location Ti.
-
8/4/2019 Computo paralelo
107/173
Then the EREW PRAM sorts the elements ofT. This step takes time (log p)(Lemma). By reading adjacent entries in the sorted array, the highest priorityprocessor accessing any particular location can be found in constant time.
ProcessorP1 reads memory location T1, retrieves the ordered pair (i1,j1), andwrites a 1 into global memory location Sj1
The remaining processorsPk, where 2
-
8/4/2019 Computo paralelo
108/173
A concurrent write operation
A concurrent write operation, which takes constant time on a p-processor
PRIORITY PRAM, can be simulated in (log p) time on ap-processor EREWPRAM.
(a) Concurrent write on the PRIORITY PRAM model.
ProcessorsP1, P2, andP4 attempt to write values to memory locationM3.ProcessorP1 wins.Processors P3 and P5 attempt to write values to memory location M7.ProcessorP3 wins.
-
8/4/2019 Computo paralelo
109/173
-
8/4/2019 Computo paralelo
110/173
Concurrent write on the EREW PRAM model
Each processor writes an (address, processor number) pair to a uniqueelement ofT.
The processors sort Tin time (log p).In constant time processors can set to 1 those elements of S
corresponding to the winning processors.Winning processors write their values.
For a write instruction, the highest priority processor accessing each memorylocation writes its value.
For a read instruction, the highest priority processor accessing each memorylocation reads that location's value, then duplicates the value in (log p) time so
there is a copy in a unique memory location for every processor to access thevalue.
-
8/4/2019 Computo paralelo
111/173
ALGORITMOS PRAM
If a PRAM algorithm has lower time complexity than an optimal RAM algorithm, it is
because parallelism has been used.
PRAM algorithms begin with only a single processor active have two phases.
oIn the first phase a sufficient number of processors are activated,othese activated processors perform the computation in parallel.
Given a single active processor, it is easy to see that logP activation steps are bothnecessary and sufficient forp processors to become active.
Processor activation
Exactly logP processor activation steps are necessary and sufficient to change from 1
active processor top active processors.
-
8/4/2019 Computo paralelo
112/173
-
8/4/2019 Computo paralelo
113/173
Modelo del rbol Binario
The binary tree is one of the most important paradigms of parallel
computingIn some algorithms data flows top-down from the root of the tree to the
leaves.Broadcast and divide-and-conquer algorithms both fit this model.
In broadcast algorithms the root sends the same data to every leaf.
In divide-and-conquer algorithms the tree represents the recursivesubdivision of problems into subproblems.In other algorithms data flows bottom-up from the leaves of the tree to the root.
These are calledfan-in orreduction operations.
-
8/4/2019 Computo paralelo
114/173
Reduccin Paralela
Given
a set ofn values a1, a2, , anan associative binary operator,reduction is the process of computinga1a2an
Parallel summation is an example of a reduction operation.
-
8/4/2019 Computo paralelo
115/173
We represent each tree node with an element in an array.The mapping from the tree to the array is straightforward.
-
8/4/2019 Computo paralelo
116/173
Sum ofn values
-
8/4/2019 Computo paralelo
117/173
Complexity:
The spawn routine requires logn/2 doubling steps.
The sequential for loop executes log n times each iteration has constant timecomplexity.Hence the overall time complexity of the algorithm is (log p), given n/2processors.
-
8/4/2019 Computo paralelo
118/173
Organizacin de procesadoresOrganizacin de procesadores representados por grafos
A processor organization can berepresented by a graphnodes (vertices) represent processorsedges represent communication paths between pairs of processors
Processor organizations are evaluated according to criteria that help usunderstand their effectiveness in implementing efficient parallel algorithms onreal hardware.These criteria are:
Diameter.Bisection width.Number of edges per node
Maximum edge length
-
8/4/2019 Computo paralelo
119/173
Dimetro
The diameter of a network is the largest distance between two nodes.Low diameter is better,
because the diameter puts a lower bound on the complexity ofparallel algorithms requiring communication between arbitrary pairs ofnodes.
-
8/4/2019 Computo paralelo
120/173
Ancho de biseccin de la red
The bisection width of a network is the minimum number of edges that
must be removed in order to divide the network into two halves (withinone).
High bisection width is better,because in algorithms requiring large amounts of data movement, thesize of the data set divided by the bisection width puts a lower boundon thecomplexity of the parallel algorithm.
-
8/4/2019 Computo paralelo
121/173
Nmero de aristas por nodo
It is best if the number of edges per node is a constant independent of
the network sizeThe processor organization scales more easily to systems with large
numbers of nodes.
Maximum edge length
For scalability reasons it is best if the nodes and edges of the network can belaid out in three-dimensional space so that the maximum edge length is aconstant independent of the network size.
-
8/4/2019 Computo paralelo
122/173
Introduccin a la complejidad de los algoritmos paralelosTiempos de corrimiento
Tipos de anlisis para algoritmos
1. Peor caso (usualmente)T(n) = tiempo mximo necesario para un problema de tamao n
2. Caso medio o caso promedioT(n) = tiempo esperado para un problema de tamao nSe requiere establecer una distribucin estadstica
3. Mejor casoT(n) = tiempo mnimo para un problema de tamao n. Esto puede ser engaosoporque no siempre ocurre.
l jid d d l d l f i f l
-
8/4/2019 Computo paralelo
123/173
La complejidad del peor caso de un programa RAM es la funcin f(n), eltiempo mximo que le toma al programa ejecutar todas las entradas de tamaon.
La complejidad del tiempo esperado de un programa RAM es el tiempopromedio que le toma ejecutar las entradas de tamao n .
Existen dos maneras de medir el tiempo en el modelo RAM:
Criterio de costo uniforme: cada instruccin RAM requiere de una unidad detiempo para ejecutar y cada registro requiere una unidad de espacio enmemoria. Es apropiado si los valores manipulados por el programa cabendentro de una palabra de cmputo.
Criterio de costo logartmico: se toma en cuenta que una palabra actual enmemoria tiene una capacidad limitada en memoria.
L O
-
8/4/2019 Computo paralelo
124/173
La gran O
Usualmente se utiliza la notacin O (g(x)) para referirse a las funciones
acotadas superiormente por la funcin g(x).
La cota ajustada asinttica (notacin ) tiene relacin con las cotas superior e
inferior asintticas (notacin ):
Una cota superior asinttica es una funcin que sirve de cota superior de otrafuncin cuando el argumento tiende a infinito.
O d l f i
-
8/4/2019 Computo paralelo
125/173
Ordenes usuales para funciones:
Los rdenes ms utilizados en anlisis de algoritmos, en orden creciente, son
los siguientes (donde c representa una constante):
D di d d l d d l f i di fi i t P
-
8/4/2019 Computo paralelo
126/173
Dependiendo del orden de la funcin, se dice que es eficiente o no. Para unproblema de tamao n, el orden de la funcin que lo resuelve puede ser desdeptimo (mejor caso) o eficiente hasta ineficiente (peor caso).
DISEO DE PROGRAMAS
PARALELOS
-
8/4/2019 Computo paralelo
127/173
PARALELOS
Sistemas de memoria nica
Single Memory SystemsThe CHALLENGE/Onyx uses a high speed system bus to connect allcomponents of the system.
Memory has features:There is a single address map; that is, the same word of memory has the
same address in every CPU.There is no time penalty for communication between processes because
every memory word is accessible in the same amount of time from anyCPU.
All peripherals are equally accessible from any process.Processes running in different CPUs can share memory and can update the
identical memory locations concurrently.
-
8/4/2019 Computo paralelo
128/173
Processes running in different CPUs can share memory and can update theidentical memory locations concurrently.
Processes map a single segment of memory into the virtual address spacesof two or more concurrent processes.
Two processes can transfer data at memory speeds, one putting the data intoa mapped segment and the other process taking the data out.
They can coordinate their access to the data using semaphores located in theshared segment.
-
8/4/2019 Computo paralelo
129/173
SISTEMAS DE MEMORIA MLTIPLEMultiple Memory Systems
There is not a single address map. A word of memory in one nodecannot be addressed at all from another node.
There is a time penalty for some interprocess communication.Peripherals are accessible only in the node to which they are
physically attached.
The message-passing interface (MPI) is designed specifically for anapplication that executes concurrently in multiple nodes.
Modelos de ejecucin paralela
-
8/4/2019 Computo paralelo
130/173
Modelos de ejecucin paralela
Two features of the models for parallel programming:
Granularity
Communicationchannel
The relative size of the unit of computation thatexecutes in parallel: a single statement, afunction, or an entire process.
The basic mechanism by which theindependent, concurrent units of the programexchange data and synchronize their activity.
Paralelismo a nivel procesos
-
8/4/2019 Computo paralelo
131/173
Paralelismo a nivel procesosA UNIX process consists of
an address space,a large set of process state values,one thread of execution.
Interprocess communication (IPC) mechanisms can be usedoto exchange dataoto coordinate the activities of multiple, asynchronous processes.
In traditional UNIX practice, one process creates another with the system callfork(), which makes a duplicate of the calling process, after which the twocopies execute in parallel.
Typically the new process immediately uses the exec() function to load a newprogram.
Lightweight process
-
8/4/2019 Computo paralelo
132/173
Lightweight process
It shares some of its process state values with its parent process.
It does not have its own address space. It continues to execute in the address space of the original process.
A lightweight process differs from a thread in two significant ways:
It has a full set of UNIX state values. Some of these, for example the table of
open file descriptors, can be shared with the parent process, but in general alightweight process carries most of the state information of a process. Dispatch of lightweight processes is done in the kernel, and has the same
overhead as dispatching any process. The library support for statement-level parallelism is based on the use of
lightweight processes.
Process Creation
-
8/4/2019 Computo paralelo
133/173
Process CreationThe process that creates another is called theparent process.The processes it creates arechildprocesses.The parent and its children together are a share group.
The fork () function is the traditional UNIX way of creating a process.The new process is a duplicate of the parent process, running in a
duplicate of the parent's address space.
Both execute the identical program text.A parent process should not terminate while its child processes continue torun.
Process Management
-
8/4/2019 Computo paralelo
134/173
Process Management
When the parent process has nothing to do after starting the child processes, it
canloop on wait() until wait() reports no more children exist;then exit.
Sometimes it is necessary to handle child termination, and the parent cannotsuspend.
In this case the parent can treat the termination of a child process asanasynchronous event.
Parallelism in Real-Time Applications
-
8/4/2019 Computo paralelo
135/173
Parallelism in Real-Time Applications
In real-time programs such as aircraft or vehicle simulators, separate
processes are used to divide the work of the simulation and distribute it ontomultiple CPUs.
In these applications, IRIX facilities (REACT Real-Time Programmer's Guide )can be used to:
reserve one or more CPUs of a multiprocessor for exclusive use by theapplication
isolate the reserved CPUs from all interruptsassign specific processes to execute on specific, reserved CPUs
The Frame Scheduler
-
8/4/2019 Computo paralelo
136/173
The Frame Schedulerseizes one or more CPUs of a multiprocessor,isolates them,executes a specified set of processes on each CPU in strict rotation.
The Frame Scheduler has
much lower overhead than the normal IRIX scheduler,features designed for real-time work, including
odetection ofoverrun (when a scheduled process does not completeits work in the necessary time)
ounderrun (when a scheduled process fails to execute in its turn).
Paralelismo a nivel hilos
-
8/4/2019 Computo paralelo
137/173
Paralelismo a nivel hilosA thread is an independent execution state within the context of a largerprogram; that is,
a set of machine registers,a call stack,the ability to execute code.
A program can create many threads to execute in the same address space.
There are two main reasons of using threads:portabilityperformance.
There are three key differences between a thread and a process:
-
8/4/2019 Computo paralelo
138/173
There are three key differences between a thread and a process:
A UNIX process has its own set of UNIX state information, for
example, its own effective user ID and set of open file descriptors.Threads exist within a process and do not have distinct copies of these
state values. Threads share the single state belonging to their process.Each UNIX process has a unique address space that are accessible only
to that process.Threads within a process share the single address space belonging to
their process.Processes are scheduled by the kernel.Threads are scheduled by code that operates in the user address space,
without kernel assistance. Thread scheduling can be faster than processscheduling.
Threads:
-
8/4/2019 Computo paralelo
139/173
Threads:
takes relatively little time to create or destroy, as compared to
creating a lightweight process.shares all resources and attributes of a single process (except for the
signal mask).
If you wanteach executing entity to have its own set of file descriptorsto make sure that one entity cannot modify data shared with another
entity
You must use lightweight processes or normal processes.
Threads cannot use these IPC mechanisms.
-
8/4/2019 Computo paralelo
140/173
Threads can coordinate using mechanisms:
Unnamed semaphores for general coordination and resourcemanagement.
Message queues.Mutex objects, which allow threads to gain exclusive use of a shared
variable.
Condition variables, which allow a thread to wait when a controllingpredicate is false.
Semaphores, locks, and barriers to coordinate between multiple threads withina single program.
Mutexes:
-
8/4/2019 Computo paralelo
141/173
ute es:A mutex is a software object that stands
for the right to modify some shared variable,
the right to execute a critical section of code.
A mutex can be owned by only one thread at a time; other threads trying to acquire itwait.
When a thread wants to modify a variable that it shares with other threads, or execute acritical section, the thread claims the associated mutex.
This can cause the thread to wait until it can acquire the mutex.
When the thread has finished using the shared variable or critical code, it releases themutex.If two or more threads claim the mutex at once,
one acquires the mutex and continues, the others are blocked until the mutex is released.
A mutex has attributes that control its behavior.
Condition Variables
-
8/4/2019 Computo paralelo
142/173
Condition Variables
A condition variable provides a way in whicha thread can temporarily give up ownership of a mutex,wait for a condition to be true,then reclaim ownership of the mutex,
All in a single operation.
Preparing Condition Variables
Condition variables are supplied witha mechanism of attribute objectsstatic and dynamic initializers.A condition variable must be initialized before use.
Using Condition Variables
-
8/4/2019 Computo paralelo
143/173
Using Condition Variables
A condition variable is a software object that represents a test of a Boolean
condition.
Typically the condition changes because of a software event such as"other thread has supplied needed data."
A thread that wants to wait for that event claims the condition variable,which causes it to wait.
The thread that recognizes the event signals the condition variable,releasing one or all threads that are waiting for the event.
-
8/4/2019 Computo paralelo
144/173
A thread holds a mutex that represents a shared resource. While holding themutex, the thread finds that the shared resource is not complete or not ready.
The thread needs to do three things
Give up the mutex so that some other thread can renew the sharedresource.
Wait for the event that "resource is now ready for use."Re-acquire the mutex for the shared resource.
These three actions are combined into one using a condition variable.
When the event is signalled (or the time limit expires), the mutex is reacquired.
Paralelismo a nivel declaraciones
-
8/4/2019 Computo paralelo
145/173
Statement-Level Parallelism.
The finest level of granularity.Statement-level parallel support is based on using common variables in memory,and so it can be used only within the bounds of a single-memory system.
The method of creating an optimized parallel program is as follows: Write a complete application that runs on a single processor. Completely debug and verify the correctness of the program in serial
execution. Apply the source analyzer. Add assertions to the source program. These are not explicit commands to
parallelize, but high-level statements that describe the program's use of data. Run the program on a single-memory multiprocessor.
Modelos de computacin distribuida
-
8/4/2019 Computo paralelo
146/173
ode os de co pu ac d s bu da
Modelo MPI (Message-Passing-Interface)
MPI is a standard programming interface for the construction of a portable,parallel application in Fortran 77 or in C, especially when the application can bedecomposed into a fixed number of processes operating in a fixed topology (forexample, a pipeline, grid, or tree).
Modelo PVM (Parallel Virtual Machine)
PVM is an integrated set of software tools and libraries that emulates ageneral-purpose, flexible, heterogeneous, concurrent computing framework oninterconnected computers of varied architecture. Using PVM, you can create aparallel application that executes as a set of concurrent processes on a set ofcomputers that can include uniprocessors, multiprocessors, and nodes of Arraysystems.
Each is a formal, abstract model for distributing a computation across the nodes of a
-
8/4/2019 Computo paralelo
147/173
multiple-memory system, without having to reflect the system configuration in thesource code.
Processes and threads allow you to execute in parallel within a single systemmemory. When the system memory is distributed among multiple independent machines,
your program must be built around a message-passing model.
In a message-passing model, your application consists of: multiple, independent processes, each with its own address space, running in possibly many different computers.
Each process shares data and coordinates with the others by passing messages.IRIX supports two libraries: Message-Passing Interface (MPI) Portable Virtual Machine (PVM).
Choosing Between MPI and PVM
-
8/4/2019 Computo paralelo
148/173
g
MPI interface is a primary and preferred model for distributed applications.
In many ways, MPI and PVM are similar:
Each is designed, specified, and implemented by third parties that haveno direct interest in selling hardware.
Support for each is available over the Internet at low or no cost.Each defines portable, high-level functions that are used by a group of
processes to make contact and exchange data without having to be awareof the communication medium.
Each supports C and Fortran 77.Each provides for automatic conversion between different
representations of the same kind of data so that processes can be
distributed over a heterogeneous computer network.
MPI
-
8/4/2019 Computo paralelo
149/173
The primary reason MPI is preferred is performance.The design of MPI is such that a highly optimized implementation
could be created for the homogenous environment.MPI applications take advantage to exchange data with small latencies
and high data rates.
Another difference between MPI and PVM is in the support for the"topology" (the interconnect pattern: grid, torus, or tree) of thecommunicating processes.
In MPI, the group size and topology are fixed when the group is created.This permits low-overhead group operations.In PVM, group composition is dynamic and causes more overhead in
common grouprelated operations.
Converting a PVM program into an MPI program
-
8/4/2019 Computo paralelo
150/173
A large extent the library calls of MPI and PVM provide similar functionality
Some PVM calls do not have a counterpart in MPI, and vice versa. The semantics of some of the equivalent calls are inherently different for the two
libraries. The process of converting a can be complicated, depending on the particular PVM
calls and how they are used. PVM includes a console, which is useful for monitoring and controlling the states
of the machines in the virtual machine and the state of execution of a PVM job. The MPI standard does not provide mechanisms for specifying the initial
allocation of processes to an MPI computation and their binding to physicalprocessors.
The differences between PVM and MPI
-
8/4/2019 Computo paralelo
151/173
The chief differences between the current versions of PVM and MPI libraries
are as follows:PVM supports dynamic creating of tasks, whereas MPI does not.PVM supports dynamic process groups; that is, groups whose
membership can change dynamically at any time during a computation.MPI does not support dynamic process groups.
The chief difference between PVM groups and MPI communicators is thatany PVM task can join/leave a group independently, whereas in MPI allcommunicator operations are collective.
A PVM task can add or delete a host from the virtual machine, therebydynamically changing the number of machines a program runs on. This isnot available in MPI.
The differences between PVM and MPI (2)
-
8/4/2019 Computo paralelo
152/173
PVM provides two methods of signaling other PVM tasks:o
sending a UNIX signal to another task,onotifying a task about an event by sending it a message with a user-specified tag that the application can check.
oThese functions are not available in MPI. A task can leave from a PVM session as many times as it wants, whereas an MPI
task must initialize/finalize exactly once. A PVM task can be registered by another task as responsible for adding new PVM
hosts, or as a PVM resource manager, or as responsible for starting new PVMtasks.
These features are not available in MPI. A PVM task can multicast data to a set of tasks. As opposed to a broadcast, this
multicast does not require the participating tasks to be members of a group. MPIdoes not have a routine to do multicasts.
-
8/4/2019 Computo paralelo
153/173
On the other hand, MPI provides several features that are not available inPVM, including
a variety of communication modes,communicators,derived data types,additional group management facilities,virtual process topologies, larger set of collective communication calls.
Programming models supported by PVM
-
8/4/2019 Computo paralelo
154/173
Pure SPMD ProgramIn the SPMD program model, n instances of the same program arestarted as the n tasks of the parallel job, using the spawn command (or by
hand at each of the n hosts simultaneously).No tasks are dynamically spawned in the tasks.This scenario is essentially the same as the current MPI one where no tasks
are dynamically spawned.
General SPMD Model
In this model,n instances of the same program are executed as n tasks ofthe parallel job.
One or more tasks are started at the beginning, and these dynamicallyspawn the remaining tasks in turn.
-
8/4/2019 Computo paralelo
155/173
MPMD ModelIn an MPMD programming model, one or more distinct tasks (having
different executables) are started by hand, and these tasks dynamicallyspawn other (possibly distinct) tasks.
Control del paralelismoS t
-
8/4/2019 Computo paralelo
156/173
Segmentos
La memoria se manipula a travs de sus localidades de memoria, pero sudireccionamiento puede ser:
lineal (directo), segmentacin (datos, cdigo, pila y extras), descriptores globales y locales
(indican localidades compartidas por diferentes segmentos). paginacin (lineal + desplazamiento+tabla pginas + directorio de tablas) combinacin de segmentacin y paginacin (memoria virtual y niveles de
proteccin:
Kernel-0, Servicios del sistema-1, Extensiones Sist. Op. -2, Aplicaciones-3.)Kernel es el que tiene el mximo privilegio y la mayor proteccin.
Excepciones (instrucciones) Interrupciones (dispositivos).
ProcesosMPI l j j l d l t l d l li di d
-
8/4/2019 Computo paralelo
157/173
MPI es el mejor ejemplo del control de paralelismo por medio de procesos.
SemforosObjetos de software usados para coordinar el acceso a recursos contados.
Lectores y escritoresSon objetos de software que marcan e interpretan una seal en cada proceso.
Secciones crticas
Cuellos de botella.
SincronizacinUna manera conveniente de sincronizar procesos paralelos en sistemas demultiprocesadores, son las barreras.
Comunicacin interprocesos
-
8/4/2019 Computo paralelo
158/173
"Tipos de Comunicacin interprocesos
Compartir memoria entre procesos (memoria compartida) Exclusin Mutua, incluye semforos (semaphores), candados (locks), y similares Sealizacin de eventos Colas de mensajes (Message Queues), describe dos variedades de cola de
mensajes Bloqueo de archivo y grabado (File and Record Locking)
Una comunicacin Interproceso (IPC) es Cualquier coordinacin de las acciones de mltiples procesos Enviar datos de un procesador a otro
No se deben mezclar las implementaciones de un mecanismo dado en un programa
sencillo, pueden dar resultados impredecibles.
-
8/4/2019 Computo paralelo
159/173
TENDENCIAS DEL CMPUTO PARALELOPROYECTO GRID
-
8/4/2019 Computo paralelo
160/173
PROYECTO GRID
Is a single destination site for large-scale, non-profit research projects of globalsignificance. With the participation of over 3 million devices worldwide,grid.org projects like Cancer Research, Anthrax Research, Smallpox Researchand the new Human Proteome Folding Project (running in conjunction withIBM's new World Community Grid) have achieved record levels of processingspeed and success.
Grid.org projects are powered by United Devices' Grid MP technology, theleading solution for commercial enterprise grid deployments.
The basics
-
8/4/2019 Computo paralelo
161/173
Grid computing is a form of distributed computing that involves coordinating
and sharing computing, application, data, storage, or network resources acrossdynamic and geographically dispersed organizations. Grid technologiespromise to change the way organizations tackle complex computationalproblems. However, the vision of large scale resource sharing is not yet areality in many areas Grid computing is an evolving area of computing,where standards and technology are still being developed to enable this new
paradigm.
Why is it important?
-
8/4/2019 Computo paralelo
162/173
Time and Money. Organizations that depend on access to computational power
to advance their business objectives often sacrifice or scale back new projects,design ideas, or innovations due to sheer lack of computational bandwidth.Project demands simply outstrip computational power, even if an organizationhas significant investments in dedicated computing resources.
Even given the potential financial rewards from additional computational
access, many enterprises struggle to balance the need for additional computingresources with the need to control costs. Upgrading and purchasing newhardware is a costly proposition, and with the rate of technology obsolescence,it is eventually a losing one. By better utilizing and distributing existingcompute resources, Grid computing will help alleviate this problem.Delivering grid benefits today
Many companies want to take advantage of the cost and efficiency benefits thatf id i f t t t d ith t b i l k d i t t th t
-
8/4/2019 Computo paralelo
163/173
come from a grid infrastructure today, without being locked in to a system thatwill not grow with their needs.
To provide customers the solution they need, United Devices tackled thecomplex security, scalability and unobtrusiveness issues required for a superiorenterprise grid, while building towards the open standards of the GGF. Byembracing these standards, United Devices lets our customers move towardcompatibility with future grid technologies and adopt upcoming technologies
as they are developed while delivering the promises and benefits of the gridtoday.
The Grid MP platform by United Devices works by amalgamating theunderutilized IT resources on a corporate network into a powerful enterprise
-
8/4/2019 Computo paralelo
164/173
underutilized IT resources on a corporate network into a powerful enterprisegrid that can be shared by groups across the organization even
geographically disparate groups.
The most common corporate technology asset, desktop PCs, are also the mostunderutilized, often only using 10% of their total compute power even whenactively engaged in their primary business functions. By harnessing theseplentiful underused computing assets and leveraging them for revenue-driving
projects, the Grid MP platform provides immediate value for companies whowant to move forward with their grid strategies without limiting any future griddevelopments.
The benefits of building an enterprise grid with Grid MP platforminclude:
-
8/4/2019 Computo paralelo
165/173
include:
Lower Computing Costs
On a price-to-performance basis, the Grid MP platform gets more work donewith less administration and budget than dedicated hardware solutions.Depending on the size of your network, the price-forperformance ratio forcomputing power can literally improve by an order of magnitude.
Faster Project Results
The extra power generated by the Grid MP platform can directly impact an
-
8/4/2019 Computo paralelo
166/173
The extra power generated by the Grid MP platform can directly impact anorganization's ability to win in the marketplace by shortening product
development cycles and accelerating research and development processes.
Better Product Results
Increased, affordable computing power means not having to ignore promisingavenues or solutions because of a limited budget or schedule. The power
created by the Grid MP platform can help to ensure a higher quality product byallowing higher-resolution testing and results, and can permit an organizationto test more extensively prior to product release.Linux and Beowulf
Linux is an open-source, freely-available Unix kernel. It includes free compilers,
debuggers, editors, text processors, WWW servers, mail servers, SQL servers anda whole range of useful tools
Beowulf is a project that produced parallel Linux clusters with off-the-shelfhardware and freely available software
-
8/4/2019 Computo paralelo
167/173
hardware and freely available softwareoUses inexpensive Intel xx86 based boxes (PCs)oOne or more networking methods (10/100 Ethernet, Myrinet, ATM, etc)oA fast network connectivity (hub, switch, gigabit switch or other)oLinux operating system (freely available Unix clone, includes full source
code)occ, f77, vi, emacs, perl, python and other free compilers and toolsoFree PVM or MPI implementations (MPICH, LAM) , etcoCommercial HPF implementations are available
PC-cluster style supercomputer, but much cheaper
Herramientas para Programacin ParalelaIntroduccin a los clusters Linux
-
8/4/2019 Computo paralelo
168/173
Introduccin a los clusters Linux
La compaa de efectos especiales Weta Digital utiliz un cluster de 200servidores Linux para realizar animaciones computarizadas en la pelcula El
Seor de los Anillos. Weta Digital colabor con otros proyectos como Shreky Titanic.
La ventaja de usar Linux es el bajo costo y el cdigo abierto (open source) que
permite adaptar el sistema a las necesidades tcnicas.
Un cluster es el conjunto de procesadores fsicamente separados (no en elmismo gabinete) bajo un mismo sistema operativo, que en conjunto trabajanpara una misma aplicacin.
-
8/4/2019 Computo paralelo
169/173
Granja de procesadores (FARM) e imagen de satlite procesada en paralelo.
A continuacin podemos ver otras aplicaciones donde fueron utilizadosclusters bajo sistema operativo Linux
-
8/4/2019 Computo paralelo
170/173
clusters bajo sistema operativo Linux.
Plot 3-D simulation of space shuttle streamlining
-
8/4/2019 Computo paralelo
171/173
Airflow design for performance and fuel efficiency
-
8/4/2019 Computo paralelo
172/173
-
8/4/2019 Computo paralelo
173/173