Designing Cloud and Grid Computing Systems with InfiniBand and High-Speed EthernetDhabaleswar K. (DK) PandaThe Ohio State UniversityE-mail: panda@cse.
Hadoop Architecture• Underlying Hadoop Distributed File System (HDFS)• Fault-tolerance by replicating data blocks• NameNode: stores information on dat
CCGrid '11OpenFabrics Stack with Unified Verbs InterfaceVerbs Interface(libibverbs)Mellanox(libmthca)QLogic(libipathverbs)IBM (libehca)Chelsio(li
• For IBoE and RoCE, the upper-level stacks remain completely unchanged• Within the hardware:– Transport and network layers remain completely unchange
CCGrid '11OpenFabrics Software StackSA Subnet AdministratorMAD Management DatagramSMA Subnet Manager AgentPMA Performance Manager AgentIPoIB IP o
CCGrid '11103InfiniBand in the Top500Percentage share of InfiniBand is steadily increasing
45%43%6%1%0%0%1%0%0%0%4%Number of SystemsGigabit Ethernet InfiniBandProprietary MyrinetQuadrics Mixed NUMAlink SP Switch Cray Interconnect Fat Tree Cu
105InfiniBand System Efficiency in the Top500 ListCCGrid '1101020304050607080901000 50 100 150 200 250 300 350 400 450 500Efficiency (%)Top 500 S
• 214 IB Clusters (42.8%) in the Nov ‘10 Top500 list (http://www.top500.org)• Installations in the Top 30 (13 systems):CCGrid '11Large-scale Infi
• HSE compute systems with ranking in the Nov 2010 Top500 list– 8,856-core installation in Purdue with ConnectX-EN 10GigE (#126)– 7,944-core installat
• HSE has most of its popularity in enterprise computing and other non-scientific markets including Wide-area networking• Example Enterprise Computing
• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati
Memcached Architecture• Distributed Caching Layer– Allows to aggregate spare memory from multiple nodes– General purpose• Typically used to cache data
Modern Interconnects and Protocols110ApplicationVerbsSocketsApplicationInterfaceTCP/IPHardwareOffloadTCP/IPEthernetDriverKernelSpaceProtocolImplementa
• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf
CCGrid '11112Low-level Latency Measurements051015202530VPI-IBNative IBVPI-EthRoCESmall MessagesLatency (us)Message Size (bytes)010002000300040005
CCGrid '11113Low-level Uni-directional Bandwidth Measurements02004006008001000120014001600VPI-IBNative IBVPI-EthRoCEUni-directional BandwidthBand
• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf
• High Performance MPI Library for IB and HSE– MVAPICH (MPI-1) and MVAPICH2 (MPI-2.2)– Used by more than 1,550 organizations in 60 countries– More tha
CCGrid '11116One-way Latency: MPI over IB0123456Small Message LatencyMessage Size (bytes)Latency (us)1.961.541.602.17050100150200250300350400MVAP
CCGrid '11117Bandwidth: MPI over IB0500100015002000250030003500Unidirectional BandwidthMillionBytes/secMessage Size (bytes)2665.63023.71901.11553
CCGrid '11118One-way Latency: MPI over iWARP0102030405060708090Chelsio (TCP/IP)Chelsio (iWARP)Intel-NetEffect (TCP/IP)Intel-NetEffect (iWARP)Mess
CCGrid '11119Bandwidth: MPI over iWARP0200400600800100012001400Message Size (bytes)Unidirectional BandwidthMillionBytes/sec839.81169.7373.31245.0
• Good System Area Networks with excellent performance (low latency, high bandwidth and low CPU utilization) for inter-processor communication (IPC) a
CCGrid '11120Convergent Technologies: MPI Latency0102030405060Small MessagesLatency (us)Message Size (bytes)0200040006000800010000120001400016000
CCGrid '11121Convergent Technologies:MPI Uni- and Bi-directional Bandwidth02004006008001000120014001600Native IBVPI-IBVPI-EthRoCEUni-directional
• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf
CCGrid '11123IPoIB vs. SDP Architectural ModelsTraditional ModelPossible SDP ModelSockets AppSockets APISockets ApplicationSockets APIKernelTCP/I
CCGrid '11124SDP vs. IPoIB (IB QDR)050010001500200028321285122K8K32KBandwidth (MBps)IPoIB-RCIPoIB-UDSDP0510152025302481632641282565121K2KLatency
• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf
• Option 1: Layer-1 Optical networks– IB standard specifies link, network and transport layers– Can use any layer-1 (though the standard says copper a
Features• End-to-end guaranteed bandwidth channels• Dynamic, in-advance, reservation and provisioning of fractional/full lambdas• Secure control-plane
• Supports SONET OC-192 or 10GE LAN-PHY/WAN-PHY• Idea is to make remote storage “appear” local• IB-WAN switch does frame conversion– IB standard allow
CCGrid '11129InfiniBand Over SONET: Obsidian Longbows RDMAthroughput measurements over USNLinuxhostORNL700 milesLinuxhostChicagoCDCISeattleCDCISu
• Hardware components– Processing cores and memory subsystem– I/O bus or links– Network adapters/switches• Software components– Communication stack• B
CCGrid '11130IB over 10GE LAN-PHY and WAN-PHYLinuxhostORNL700 milesLinuxhostSeattleCDCIORNLCDCIlongbowIB/SlongbowIB/S3300 miles 4300 milesORNL lo
MPI over IB-WAN: Obsidian RoutersDelay (us) Distance (km)10 2100 201000 20010000 2000Cluster ACluster BWAN LinkObsidian WAN Router Obsidian WAN Router
Communication Options in Grid• Multiple options exist to perform data transfer on Grid• Globus-XIO framework currently does not support IB natively• W
Globus-XIO Framework with ADTS DriverGlobus XIO Driver #nDataConnectionManagementPersistentSessionManagementBuffer &FileManagementData Transport I
134Performance of Memory BasedData Transfer• Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files• ADTS
135Performance of Disk Based Data Transfer• Performance numbers obtained while transferring 128 GB of aggregate data in chunks of 256 MB files• Predic
136Application Level Performance050100150200250300CCSMUltra-VizBandwidth (MBps)Target ApplicationsADTSIPoIB• Application performance for FTP getopera
• Low-level Network Performance• Clusters with Message Passing Interface (MPI)• Datacenters with Sockets Direct Protocol (SDP) and TCP/IP (IPoIB)• Inf
A New Approach towards OFA in CloudCurrent ApproachTowards OFA in CloudApplicationAccelerated Sockets10 GigE or InfiniBandVerbs / Hardware OffloadCurr
Memcached Design Using Verbs• Server and client perform a negotiation protocol– Master thread assigns clients to appropriate worker thread• Once a cli
• Ex: TCP/IP, UDP/IP• Generic architecture for all networks• Host processor handles almost all aspects of communication– Data buffering (copies on sen
Memcached Get Latency• Memcached Get latency– 4 bytes – DDR: 6 us; QDR: 5 us– 4K bytes -- DDR: 20 us; QDR:12 us• Almost factor of four improvement ove
Memcached Get TPS• Memcached Get transactions per second for 4 bytes– On IB DDR about 600K/s for 16 clients – On IB QDR 1.9M/s for 16 clients• Almost
Hadoop: Java Communication Benchmark• Sockets level ping-pong bandwidth test• Java performance depends on usage of NIO (allocateDirect)• C and Java ve
Hadoop: DFS IO Write Performance• DFS IO included in Hadoop, measures sequential access throughput• We have two map tasks each writing to a file of in
Hadoop: RandomWriter Performance• Each map generates 1GB of random binary data and writes to HDFS• SSD improves execution time by 50% with 1GigE for t
Hadoop Sort Benchmark• Sort: baseline benchmark for Hadoop• Sort phase: I/O bound; Reduce phase: communication bound• SSD improves performance by 28%
• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati
• Presented network architectures & trends for Clusters, Grid, Multi-tier Datacenters and Cloud Computing Systems• Presented background and detail
CCGrid '11Funding AcknowledgmentsFunding Support byEquipment Support by148
CCGrid '11Personnel AcknowledgmentsCurrent Students – N. Dandapanthula (M.S.)– R. Darbha (M.S.)– V. Dhanraj (M.S.)– J. Huang (Ph.D.)– J. Jose (P
• Traditionally relied on bus-basedtechnologies (last mile bottleneck)– E.g., PCI, PCI-X– One bit per wire– Performance increase through:• Increasing
CCGrid '11Web Pointershttp://www.cse.ohio-state.edu/~pandahttp://www.cse.ohio-state.edu/~surshttp://nowlab.cse.ohio-state.eduMVAPICH Web Pagehttp
• Network speeds saturated at around 1Gbps– Features provided were limited– Commodity networks were not considered scalable enough for very large-scal
• Industry Networking Standards• InfiniBand and High-speed Ethernet were introduced into the market to address these bottlenecks• InfiniBand aimed at
• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati
• IB Trade Association was formed with seven industry leaders (Compaq, Dell, HP, IBM, Intel, Microsoft, and Sun)• Goal: To design a scalable and high
• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati
• 10GE Alliance formed by several industry leaders to take the Ethernet family to the next speed step• Goal: To achieve a scalable and high performanc
• Network speed bottlenecks• Protocol processing bottlenecks• I/O interface bottlenecksCCGrid '1121Tackling Communication Bottlenecks with IB and
• Bit serial differential signaling– Independent pairs of wires to transmit independent data (called a lane)– Scalable to any number of lanes– Easy to
CCGrid '11Network Speed Acceleration with IB and HSEEthernet (1979 - ) 10 Mbit/secFast Ethernet (1993 -) 100 Mbit/secGigabit Ethernet (1995 -) 10
2005 - 2006 - 2007 - 2008 - 2009 - 2010 - 2011Bandwidth per direction (Gbps)32G-IB-DDR48G-IB-DDR96G-IB-QDR48G-IB-QDR200G-IB-EDR112G-IB-FDR300G-IB-EDR1
• Network speed bottlenecks• Protocol processing bottlenecks• I/O interface bottlenecksCCGrid '1125Tackling Communication Bottlenecks with IB and
• Intelligent Network Interface Cards• Support entire protocol processing completely in hardware (hardware protocol offload engines)• Provide a rich c
• Fast Messages (FM)– Developed by UIUC• Myricom GM– Proprietary protocol stack from Myricom• These network stacks set the trend for high-performance
• Some IB models have multiple hardware accelerators– E.g., Mellanox IB adapters• Protocol Offload Engines– Completely implement ISO/OSI layers 2-4 (l
• Interrupt Coalescing– Improves throughput, but degrades latency• Jumbo Frames– No latency impact; Incompatible with existing switches• Hardware Chec
CCGrid '11Current and Next Generation Applications and Computing Systems3• Diverse Range of Applications– Processing and dataset characteristics
• TCP Offload Engines (TOE)– Hardware Acceleration for the entire TCP/IP stack– Initially patented by Tehuti Networks– Actually refers to the IC on th
• Also known as “Datacenter Ethernet” or “Lossless Ethernet”– Combines a number of optional Ethernet standards into one umbrella as mandatory requirem
• Network speed bottlenecks• Protocol processing bottlenecks• I/O interface bottlenecksCCGrid '1132Tackling Communication Bottlenecks with IB and
• InfiniBand initially intended to replace I/O bus technologies with networking-like technology– That is, bit serial differential signaling– With enha
• Recent trends in I/O interfaces show that they are nearly matching head-to-head with network speeds (though they still lag a little bit)CCGrid &apos
• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati
• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics– Novel Features– Subnet Management and Services• High-spee
CCGrid '1137Comparing InfiniBand with Traditional Networking StackApplication LayerMPI, PGAS, File SystemsTransport LayerOpenFabrics VerbsRC (rel
• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•
• Used by processing and I/O units to connect to fabric• Consume & generate IB packets• Programmable DMA engines with protection features• May hav
CCGrid '11Cluster Computing EnvironmentCompute clusterLANFrontendMeta-DataManagerI/O ServerNodeMetaDataDataComputeNodeComputeNodeI/O ServerNodeDa
• Relay packets from a link to another• Switches: intra-subnet• Routers: inter-subnet• May support multicastCCGrid '11Components: Switches and Ro
• Network Links– Copper, Optical, Printed Circuit wiring on Back Plane– Not directly addressable• Traditional adapters built for copper cabling– Restr
• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•
CCGrid '11IB Communication ModelBasic InfiniBand Communication Semantics43
• Each QP has two queues– Send Queue (SQ)– Receive Queue (RQ)– Work requests are queued to the QP (WQEs: “Wookies”)• QP to be linked to a Complete Que
1. Registration Request • Send virtual address and length2. Kernel handles virtual->physical mapping and pins region into physical memory• Process
• To send or receive data the l_keymust be provided to the HCA• HCA verifies access to local memory• For RDMA, initiator must have the r_key for the r
CCGrid '11Communication in the Channel Semantics(Send/Receive Model)InfiniBand DeviceMemoryMemoryInfiniBand DeviceCQQPSend RecvMemorySegmentSend
CCGrid '11Communication in the Memory Semantics (RDMA Model)InfiniBand DeviceMemoryMemoryInfiniBand DeviceCQQPSend RecvMemorySegmentSend WQE cont
InfiniBand DeviceCCGrid '11Communication in the Memory Semantics (Atomics)MemoryMemoryInfiniBand DeviceCQQPSend RecvMemorySegmentSend WQE contain
CCGrid '11Trends for Computing Clusters in the Top 500 List (http://www.top500.org)Nov. 1996: 0/500 (0%)Nov. 2001: 43/500 (8.6%)Nov. 2006: 361
• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•
CCGrid '11Hardware Protocol OffloadComplete HardwareImplementationsExist51
• Buffering and Flow Control• Virtual Lanes, Service Levels and QoS• Switching and MulticastCCGrid '11Link/Network Layer Capabilities52
• IB provides three-levels of communication throttling/control mechanisms– Link-level flow control (link layer feature)– Message-level flow control (t
• Multiple virtual links within same physical link– Between 2 and 16• Separate buffers and flow control– Avoids Head-of-Line Blocking• VL15: reserved
• Service Level (SL):– Packets may operate at one of 16 different SLs– Meaning not defined by IB• SL to VL mapping:– SL determines which VL on the nex
• InfiniBand Virtual Lanes allow the multiplexing of multiple independent logical traffic flows on the same physical link• Providing the benefits of i
• Each port has one or more associated LIDs (Local Identifiers)– Switches look up which port to forward a packet to based on its destination LID (DLID
• Basic unit of switching is a crossbar– Current InfiniBand products use either 24-port (DDR) or 36-port (QDR) crossbars• Switches available in the ma
• Someone has to setup the forwarding tables and give every port an LID– “Subnet Manager” does this work• Different routing algorithms give different
CCGrid '11Grid Computing Environment6Compute clusterLANFrontendMeta-DataManagerI/O ServerNodeMetaDataDataComputeNodeComputeNodeI/O ServerNodeData
• Similar to basic switching, except…– … sender can utilize multiple LIDs associated to the same destination port• Packets sent to one DLID take a fix
CCGrid '11IB Multicast Example61
CCGrid '11Hardware Protocol OffloadComplete HardwareImplementationsExist62
• Each transport service can have zero or more QPs associated with it– E.g., you can have four QPs based on RC and one QP based on UDCCGrid '11IB
CCGrid '11Trade-offs in Different Transport Types64AttributeReliableConnectionReliableDatagrameXtendedReliableConnectionUnreliableConnectionUnrel
• Data Segmentation• Transaction Ordering• Message-level Flow Control• Static Rate Control and Auto-negotiationCCGrid '11Transport Layer Capabili
• IB transport layer provides a message-level communication granularity, not byte-level (unlike TCP)• Application can hand over a large message– Netwo
• IB follows a strong transaction ordering for RC• Sender network adapter transmits messages in the order in which WQEs were posted• Each QP utilizes
• Also called as End-to-end Flow-control– Does not depend on the number of network hops• Separate from Link-level Flow-Control– Link-level flow-contro
• IB allows link rates to be statically changed– On a 4X link, we can set data to be sent at 1X– For heterogeneous links, rate can be set to the lowes
CCGrid '11Multi-Tier Datacenters and Enterprise Computing7...Enterprise Multi-tier DatacenterTier1Tier3Routers/ServersSwitchDatabase ServerAppli
• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics• Communication Model• Memory registration and protection•
• Agents– Processes or hardware units running on each adapter, switch, router (everything on the network)– Provide capability to query and set paramet
Inactive LinksCCGrid '11Subnet ManagerActive LinksCompute NodeSwitchSubnet ManagerInactive LinkMulticast JoinMulticast SetupMulticast JoinMultica
• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics– Novel Features– Subnet Management and Services• High-spee
• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an
CCGrid '11IB and HSE RDMA Models: Commonalities and DifferencesIB iWARP/HSEHardware Acceleration Supported SupportedRDMA Supported SupportedAtomi
• RDMA Protocol (RDMAP)– Feature-rich interface– Security Management• Remote Direct Data Placement (RDDP)– Data Placement and Delivery– Multi Stream S
• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an
• Place data as it arrives, whether in or out-of-order• If data is out-of-order, place it at the appropriate offset• Issues from the application’s per
• Part of the Ethernet standard, not iWARP– Network vendors use a separate interface to support it• Dynamic bandwidth allocation to flows based on int
CCGrid '11Integrated High-End Computing EnvironmentsCompute clusterMeta-DataManagerI/O ServerNodeMetaDataDataComputeNodeComputeNodeI/O ServerNode
• Can allow for simple prioritization:– E.g., connection 1 performs better than connection 2– 8 classes provided (a connection can be in any class)• S
• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an
• Regular Ethernet adapters and TOEs are fully compatible• Compatibility with iWARP required• Software iWARP emulates the functionality of iWARP on th
CCGrid '11Different iWARP ImplementationsRegular Ethernet AdaptersApplicationHigh Performance SocketsSocketsNetwork AdapterTCPIPDevice DriverOffl
• High-speed Ethernet Family– Internet Wide-Area RDMA Protocol (iWARP)• Architecture and Components• Features– Out-of-order data placement– Dynamic an
• Proprietary communication layer developed by Myricom for their Myrinet adapters– Third generation communication layer (after FM and GM)– Supports My
• Another proprietary communication layer developed by Myricom– Compatible with regular UDP sockets (embraces and extends)– Idea is to bypass the kern
CCGrid '11Solarflare Communications: OpenOnload Stack87Typical HPC Networking StackTypical Commodity Networking Stack• HPC Networking Stack provi
• InfiniBand– Architecture and Basic Hardware Components– Communication Model and Semantics– Novel Features– Subnet Management and Services• High-spee
• Single network firmware to support both IB and Ethernet• Autosensing of layer-2 protocol– Can be configured to automatically work with either IB or
CCGrid '11Cloud Computing Environments9LANPhysical MachineVMVMPhysical MachineVMVMPhysical MachineVMVMVirtual FSMeta-DataMetaDataI/O ServerDataI/
• Native convergence of IB network and transport layers with Ethernet link layer• IB packets encapsulated in Ethernet frames• IB network layer already
• Very similar to IB over Ethernet– Often used interchangeably with IBoE– Can be used to explicitly specify link layer is Converged (Enhanced) Etherne
CCGrid '11IB and HSE: Feature ComparisonIB iWARP/HSE RoE RoCEHardware Acceleration Yes Yes Yes YesRDMA Yes Yes Yes YesCongestion Control Yes Opti
• Introduction• Why InfiniBand and High-speed Ethernet?• Overview of IB, HSE, their Convergence and Features• IB and HSE HW/SW Products and Installati
• Many IB vendors: Mellanox+Voltaire and Qlogic– Aligned with many server vendors: Intel, IBM, SUN, Dell– And many integrators: Appro, Advanced Cluste
CCGrid '11Tyan Thunder S2935 Board(Courtesy Tyan)Similar boards from Supermicro with LOM features are also available 95
• Customized adapters to work with IB switches– Cray XD1 (formerly by Octigabay), Cray CX1• Switches:– 4X SDR and DDR (8-288 ports); 12X SDR (small si
• 10GE adapters: Intel, Myricom, Mellanox (ConnectX)• 10GE/iWARP adapters: Chelsio, NetEffect (now owned by Intel)• 40GE adapters: Mellanox ConnectX2-
• Mellanox ConnectX Adapter• Supports IB and HSE convergence• Ports can be configured to support IB or HSE• Support for VPI and RoCE– 8 Gbps (SDR), 16
• Open source organization (formerly OpenIB)– www.openfabrics.org• Incorporates both IB and iWARP in a unified manner– Support for Linux and Windows–
Comentários a estes Manuais