presentation infiniband
TRANSCRIPT
-
8/3/2019 Presentation Infiniband
1/32
-
8/3/2019 Presentation Infiniband
2/32
1.1 Conventional Bus Architecture
System Controller
(System-to-I/O-Bridge)
Systembus
CPU
System Memory
PCI to PCI
Bridge
PCI to PCI
Bridge
SCSI I/O
Controller
System-I/O Bus (PCI) #1
I/O
Controller
Grahic I/O
Controller
PCI-Bus#2
SCSI I/O
Controller
LAN I/O
Controller
PCI-Bus#3
Some drawbacks of PCI:
- P2P-Bridge needs for more devices
- shared bandwith
- uncontrolled termination
- many pins for each connection
- most disadvantage: cant support out of
box
-
8/3/2019 Presentation Infiniband
3/32
Super Computers
and Mainframes
Bunch of interconnected
Linux machines
Much Lower costBut, lower reliability, underutilization, Higher
Complexity, Storage bottleneck
Paradigm Shift
-
8/3/2019 Presentation Infiniband
4/32
point-to-point switch-based interconnect
designed for fault tolerance
link has exactly one device connected
provides scalability
aggregate bandwidth increases as additional switches are
added
SWITCHED FABRIC
-
8/3/2019 Presentation Infiniband
5/32
Endnode
Endnode
Endnode
Endnode
Endnode
Switch
Switch
Switched fabric architecture
Designed for high bandwith (2.5 up to 30Gb/s), with fault tolerance and scalability.
Pushed by industry leaders like Sun, HP,IBM, intel, Microsoft, Dell.
Switch fabric is directly a point to point interconnection, means, that every link has one device
connect.
Termination is well controlled and to every device the same.
The I/O Performance greater within a fabric.
-
8/3/2019 Presentation Infiniband
6/32
Contrasting the different Architecture
We know, the PCI is the bus standard desgined to provide a low cost interface=> most I/O Connection into PC.
The bandwith capabilities are not able to keep up the requirements that servers place on it.
Today Servers need host cards like SCSI cards (soon Ultra329SCSI) GbEthernet, Clusteringcards and so on.
So, PCI can not keep up with the I/O bandwith required by these device.
Feature Fabric Bus
Topology Switched Shared BusPin Count Low High
Number Of End Points Many Few
Max Signal Length KMs Inches
Reliability Yes No
Scalable Yes No
Fault Tolerant Yes No
-
8/3/2019 Presentation Infiniband
7/32
What Defines Infiniband?Infiniband is a Specification.
A Switch Fabric Architecture Which Enables:
Increased Network Bandwidth
Improved ReliabilityFailover
Loss-less Connectivity Support
Shared Resources
Lower CPU Utilization
-
8/3/2019 Presentation Infiniband
8/32
Industry-standard specification
A system interconnect fabric architecture
Used between any combination of: Servers, Communication
equipments, Storage devices and Embedded systems
Low-latency
High-bandwidth interconnections
Low processing overhead
Carry multiple traffic types over a single connection
Infiniband Characteristics
-
8/3/2019 Presentation Infiniband
9/32
IBA (simple)
CPU
System
Controller
System
Memory
HCA
IB
SwitchTCAI/O Controller TCA I/O Controller
TCA
I/O Controller
Host Channel Adapters (HCA), Target Channel Adapter (TCA)
-
8/3/2019 Presentation Infiniband
10/32
Infiniband: A layered hardware protocol
1) Physical Layer
Defines both electrical and mechanical
characteristics for the system
Includes cables and receptables for fibre
and copper media, backplane connectors
Defines three link speeds, 1X,4X,12X
Each individual link is a 4-wire
differential connection that provides a full
duplex connection at 2.5 Gb/s
-
8/3/2019 Presentation Infiniband
11/32
3.4.1 Physical Link
1 x Link 4 x Link 12 x Link
IB Link Signal Count ignaling Rat Data Rate
Full
Duplexed
Data Rate
[Gb/s] [Gb/s] [Gb/s] [Gb/s]
1 X 4 2.5 2 4
4 X 16 10 8 16
12 X 48 30 24 48
Note: Because the data is 8b/10b encoded, the actual raw bandwith
is 2.0 Gb/s. Since bi-directional => 4Gb/s. With uses multi-port, the
I/O bandwith will be additiv.
-
8/3/2019 Presentation Infiniband
12/32
2) Link Layer
Encompasses packet layout, point-to-pointoperations, and switching within a local subnet
Packets- Two types: Management and Datapackets
1. Management Packets- Used for linkconfiguration and maintenance
2. Data Packets- Carry up to 4k bytes of atransaction payload
Switching- Devices within a subnet have a 16-bitLocal ID assigned by the subnet manager. ThisLID is used for addressing.
-
8/3/2019 Presentation Infiniband
13/32
Network Layer
Handles routing of packets from one subnetto another(within a subnet a network layer isnot required)
Packets contain a global route header(GRH)
GRH contains the 128-bit IPv6 address
Transport Layer
Responsible for in-order packet delivery,channel multiplexing and transport services
Also handles transaction data segmentationwhen sending, and reassembly whenreceiving
-
8/3/2019 Presentation Infiniband
14/32
IBA Data Packet Format
Start
DelimiterData
End
DelimiterIdles
Packet
LRH GRH BTH ETH Payload I Data ICRC VCRC
Upper Layer
Transport Layer
Network Layer
Link Layer
Local Routing Header (has 8Bytes), Global Routing Header (40B), Base Transport Header
(12B), Extended Transport Header (4,8,16or28B), Data (0-4kB), Immediate Data (4Bytes),
Invariant CRC (4B), Variant CRC (2B)
-
8/3/2019 Presentation Infiniband
15/32
IBA Fabric
Node
Node
Node
NodeNode
IBA Network
At a high level, IBA is an interconnect for endnodes
-
8/3/2019 Presentation Infiniband
16/32
IBA SubnetIBA
Subnet
IBA Subnet
Router
IBA Subnet
Router
EndNode
EndNode
EndNode
EndNodeEndNode
EndNodeEndNode
IBA Network Components
An IBA network is subdivided into subnets with interconnected by routers.
Endnodes may attached to a single subnets or attach to more than one
subnets.
-
8/3/2019 Presentation Infiniband
17/32
Switch
Switch
Switch Switch
Switch
EndNodeEndNode
EndNode
EndNodeRouter
Subnet Manager
IBA Subnet Components
An IBA subnet is composed as shown of endnodes, switches routers and a
subnet manager. Each IB device possible attach to a single switch or is
connected with more than one switch (or/and directly with each other).
-
8/3/2019 Presentation Infiniband
18/32
IBA Components
Links and Repeater
Channel Adapter
Switches
Router
Management Structure
-
8/3/2019 Presentation Infiniband
19/32
-
8/3/2019 Presentation Infiniband
20/32
VL
QP QPQP
Channel Adapter
VLVL
Port
VL VLVL
Port
VL VLVL
Port
QP
DMA
A CA has a DMA engine with special features, that allow remote and local DMA operations.Each port has ts own set of send and receive buffers.
Buffering is channeled through VL (Virtual Lines), where each line has its own flow control.
The implement Subnetmanager Agent (SMA) communicates with the subnet manager in the fabric.
SMA
Transport
Memory
-
8/3/2019 Presentation Infiniband
21/32
Packed Relay
VL
Switches
VLVL
Port
VL VLVL
Port
VL VLVL
Port
IBA switches are the fundamental routing component for intra-subnet routing.
Switches interconnect links by relaying packets between the links.
Switches have two ore more ports between which packets are relayed
Switch elements are forwarding tables.
Switches can be configured to forward either to a single location or to multiple devices.
-
8/3/2019 Presentation Infiniband
22/32
GRH Packed Relay
VL
Routers
VLVL
Port
VL VLVL
Port
VL VLVL
Port
IBA router are the routing component for inter-subnet routing.
Each subnet is uniquely identified with a subnet ID.
The router reads the Global Route Header from the IPv6 network layer Address for forwarding the packets.
Each router forwards the packet through the next subnet to another router until the packet reach the target subnet.
The last router sends the packet as the Destination LID to the subnet.
The subnet manager configures routers with information about the subnet.
-
8/3/2019 Presentation Infiniband
23/32
IBA-Management
IBA Management provides a subnet manager (SM)
SM is an entity directly attached to a subnet: Responsible for configuration
and managing switches, routers, an CAs.
A SM can be implemented in other devices, such as a CA or a switch.
configures each CA port with a range of LIDs, GIDs and subnetIDs.
configures each switch with some LIDs, the subnetID, and with its forwarding
database.
link failover
maintains the service databases for the subnet and provides a GUID to
LID/GID resolution service.
error reporting
other services to ensure a solid connection
d
-
8/3/2019 Presentation Infiniband
24/32
Road to IB
2001
Venture Funding
Early Product
Development
First silicon
2002
Early Pilots
First Generation
Beta Products
1x Product
4x Prototype
2003
Early Adopters
Commercial
Deployments 1x, 4x
Large Vendor of
IB Product
Early Native IB
Server / Storage
Application / OS
Support grows
Continued early
Adopters
First Volume1x, 4x, 12x
Growing Native IB
for Server / Storage
Application / OS
Support grows futher
2004
Rapid Adoption
1x, 4x, 12x
Sizeable Native IB
for Server / Storage
Rapid Application /
OS Support grows
2005
Rapid Market
Adoption
Close to 50% of
Servers with IB
Support
Rapid Application /
OS Support grows
2006
-
8/3/2019 Presentation Infiniband
25/32
First Vendors of IBA-Components
IBM
intel
Dell
Sun
Microsoft
HP
Mellanox
Voltaire
Banderacom
Infiniswitch
VIEO
JNI
IBA
System Vendors
IB Vendors
-
8/3/2019 Presentation Infiniband
26/32
Real Deployments Today: Wall
Street Bank with 512 Node GridSAN LAN
2 96-port
TS-270
23 24-port
TS-120
512 Server Nodes
2 TS-360 w/ Ethernet and Fibre
Channel Gateways
Core
Fabric
Edge Fabric
GRID I/O
ExistingNetworks
Fibre Channel and GigE
connectivity built seamlessly
into the cluster
http://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=bizhttp://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=bizhttp://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=bizhttp://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=bizhttp://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=bizhttp://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=bizhttp://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=bizhttp://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=bizhttp://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=bizhttp://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=bizhttp://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=bizhttp://www1.us.dell.com/content/products/productdetails.aspx/pedge_1850?c=us&cs=555&l=en&s=biz -
8/3/2019 Presentation Infiniband
27/32
520 Dual CPU Nodes
1,040 CPUs
NCSANational Centre for Supercomputing Applications
CoreFabric
Edge Fabric
6 72-portTS270
29 24-portTS120
174 uplinkcables
512 1mcables
18 ComputeNodes
18 ComputeNodes
Parallel MPI codes for commercial clients
Point to point 5.2us MPI latency
Deployed: November 2004
-
8/3/2019 Presentation Infiniband
28/32
D.E. Shaw Bio-Informatics:
Fault
Tolerant
Core
Fabric
Edge Fabric
12 96-port
TS-270
89 24-port
TS-120
1,068 5m/7m/10m/15m
uplink cables
1,066 1m
cables
12 Compute
Nodes12 Compute
Nodes
1,066 Fully Non-Blocking Fault Tolerant IB Cluster
-
8/3/2019 Presentation Infiniband
29/32
Advantages
Superior performance
Low-latency
High-efficiency
Fabric consolidation and low energy usage
Reliable, stable connections
Data Integrity
Highly interoperable environment
-
8/3/2019 Presentation Infiniband
30/32
Drawbacks
Complex in design
Few platforms supports it as yet
Bleeding edge, for now, so users will need
to perform extensive testing
-
8/3/2019 Presentation Infiniband
31/32
-
8/3/2019 Presentation Infiniband
32/32
Q&A