Latency-sensitive applications—ranging from trading platforms to core database systems—depend heavily on RDMA to deliver consistently low response times. Yet when it comes to reliability, most existing designs fall short: redundancy is typically limited to individual ports, leaving the entire RDMA path vulnerable to a single NIC failure and potentially causing service interruptions in critical environments.
To overcome this, SmartX ECP 6.3 adds support for cross-NIC, multi-link RDMA bonding. Built on OVS bonding, it allows multiple links across different physical NICs to be aggregated, upgrading storage network redundancy from port-level to NIC-level. While preserving RDMA’s microsecond-level latency advantage, it enables failover at the NIC level, ensuring continuous service in case of hardware faults.
As a result, RDMA storage networks can achieve reliability comparable to Fibre Channel (FC), meeting the demands of core workloads for both performance and stability.
Why Is RDMA Multi-Link Bonding Necessary?
RDMA (Remote Direct Memory Access) is a high-performance remote memory access technology designed for latency-sensitive workloads. By bypassing the operating system kernel and network protocol stack, RDMA enables direct data transfer between nodes, delivering three key advantages: microsecond-level latency, high throughput, and low CPU overhead.
It is particularly well-suited for scenarios such as low-latency financial trading, mission-critical databases, high-performance computing (HPC), and AI training. In these scenarios, traditional TCP/IP networks often fail to meet microsecond-level response requirements. RDMA, by contrast, significantly improves I/O efficiency and reduces application latency, making it a foundational technology for modern high-performance architectures.
However, despite its performance advantages, traditional approaches to RDMA HA expose several critical limitations:
- Limited to Linux Bond: Only multiple ports on the same physical NIC can be bonded, with no support for cross-NIC bonding.
- A physical NIC failure can directly break the RDMA link: Linux Bond cannot handle NIC-level failures, which may cause storage I/O interruption, business stalls, or even service outages.
- Unable to meet the HA requirements of mission-critical workloads: Core scenarios such as low-latency trading require NIC-level and switch-level HA, which traditional solutions fail to provide.
To address these limitations, cross-NIC RDMA bonding has been introduced as a key enhancement in the SmartX ECP 6.3.
SmartX ECP 6.3: Achieving Full High Availability for RDMA Networks
SmartX ECP 6.3 redesigns the RDMA network architecture by replacing Linux Bond with OVS Bond, while supporting multi-port bonding across different physical NICs.
This means that ports from two different physical NICs can now be added to the same bond. As a result, a single NIC failure or a single switch failure will no longer interrupt business traffic.
This delivers true HA against NIC-level failures, and also allows existing clusters to upgrade versions and acquire the new feature— helping mission-critical workloads achieve both high performance and stability.
Technical Deep Dive: Feature Evolution and Implementation Mechanisms
SmartX has continuously evolved its approach to RDMA network HA through multiple stages of technical innovation:
Earlier Releases: Linux Bonding + OpenFabrics Approach
In earlier HCI’s RDMA network designs, NIC hardware bonding was commonly achieved based on the OpenFabrics Alliance standard. This approach relies on NIC hardware capabilities to distribute traffic across multiple ports within the same physical NIC. Its core assumption is that the RDMA Queue Pair (QP) state can be shared across multiple physical ports on the same NIC. However, this architecture has several limitations:
- Only supports bonding within the same NIC: It cannot aggregate ports across different physical NICs, so a NIC-level failure will directly interrupt services.
- Difficult to implement in active-active / dual-switch architectures: In LACP scenarios, if a QP sends packets through Port A but receives through Port B, Port B cannot recognize the QP state, and the packets will be dropped directly.
- Incomplete HA capability: It cannot meet the redundancy requirements of dual-NIC and dual-switch architectures required by financial services and other mission-critical workloads.
The root cause is: QP state is tightly bound to physical ports and cannot be shared across devices, fundamentally limiting the HA capability of RDMA networks.
SmartX ECP 6.3: A New OVS + RDMA Software-Coordinated Bonding Architecture
To overcome hardware limitations, SmartX ECP 6.3 introduces an innovative approach that combines OVS bonding with RDMA software-level multipathing:
- Unified Abstraction at the Network Layer: OVS Bonding as the Foundation
With OVS Bonding, multiple physical ports are aggregated into a single logical port, providing a unified storage IP externally and remaining fully transparent to upper-layer applications. This design provides several key benefits:
- Supports bonding across different physical NICs, fundamentally breaking hardware constraints.
- Fully allows traditional TCP-based applications, such as ZooKeeper clusters.
- Natively supports load balancing and failover, providing complete HA capabilities
- RDMA Software-Layer Innovation: Multipathing Solves the Core Challenge
At the RDMA layer, SmartX ECP 6.3 introduces a multipathing mechanism to fundamentally solve the QP-bonding limitations of hardware-based approaches.
- Multiple QPs: Establishes multiple QPs for the same logical connection, with each QP bound to a different physical port.
- L4 Path Detection: Uses a Layer 4 path detection mechanism to identify the actual forwarding path of each five-tuple flow, and selects stable paths with consistent send/receive behavior.
- Fast Reconnection Within Seconds: When one path becomes unavailable, the application layer can quickly switch to a standby QP, enabling failover without service awareness and without packet loss.
At the same time, this approach fully addresses the historical limitations of OVS bonding, achieving a unified balance of cross-NIC aggregation, high stability, and high performance—making it well-suited for demanding scenarios such as financial core systems and active-active deployments.
MAC Flapping
-
- Issue: MAC flapping in RDMA networks typically occurs when the same MAC address is incorrectly learned on multiple switch ports due to host or network configuration issues (such as bonding, LAG/MLAG inconsistency, or duplicate MAC assignment). This can result in packet loss and degraded RDMA performance.
- Solutions:
-
- Periodically retrieve the Active Slave of the OVS Bond.
- Force RDMA connections to be established only through the active port.
- Keep traffic paths consistent to eliminate MAC flapping at the source.
Long and Unpredictable Failover Reconnection Time
-
- Issue: During failover, network disconnection and slow reconnection may affect replica safety and I/O latency.
- Solutions:
-
- Use an L3/L4-like detection mechanism to quickly identify available ports.
- Optimize connection rebuilding and QP reconnection logic.
- Significantly reduce failover latency and improve overall stability.
Unstable Connections in balance-tcp Mode
-
- Issue: In Port Channel scenarios, inconsistent switch hashing may cause return traffic to arrive on a different port than the one used to establish the RDMA connection, resulting in packet drops.
- Solutions:
-
- First send probe packets to determine the available ports on both ends
- Then establish the RDMA connection and bind it to the selected port
- Ensure send/receive path consistency so the connection remains stable and reliable
Solution Comparison: Linux Bond vs. OVS Bond
| Dimensions | Linux Bond | OVS Bond |
| Cross-NIC Bonding | Not supported (only multiple ports on the same NIC) | Supported (true NIC-level HA) |
| HA Coverage | Port-level failures only | Port-level + NIC-level + switch-level failures |
| Stability | High, based on the standard implementation | High (all historical issues have been fixed in SmartX ECP 6.3) |
| Failover Perception | No service impact, no packet loss | Short failover time, acceptable to business workloads |
| Supported Bonding Modes | active-backup / balance-xor / 802.3ad | active-backup / balance-tcp |
| Unsupported Modes | None | balance-slb |
| Applicable Scenarios | General architectures without HA requirements | Financial core systems, active-active architectures, and high-performance / high-reliability scenarios |
| Upgrade for Existing Clusters | Used natively | Supports one-click migration through network-tool |
Feature Highlights
- True NIC-Level HA Supports bonding RDMA ports across different physical NICs, ensuring that a single NIC failure does not interrupt business traffic.
- Dual-Switch HA as a Standard Architecture Meeting the stringent reliability requirements of financial services, trading systems, and mission-critical databases.
- Fully Resolves Long-Standing Technical Issues Fixes issues such as MAC flapping, unpredictable reconnection, and unstable balance-tcp behavior.
- Supports Two Production-Ready Bonding Modes a. active-backup: primary/standby HA b. balance-tcp: load balancing with aggregated bandwidth across multiple ports
- Automatic Multipathing for Data Channels In balance-tcp mode, multipathing is enabled automatically, fully utilizing multi-NIC bandwidth and improving cluster write performance.
- Allows Existing Clusters for Smooth Upgrade Existing clusters using Linux Bond can be switched via
network-toolin one click, without requiring architectural changes. - Unified Network Architecture Storage networks are unified under OVS Bond, simplifying management and improving operational efficiency.
Conclusion
SmartX ECP 6.3 addresses the long-standing industry challenge of “insufficient HA for RDMA networks” through cross-NIC RDMA bonding:
- High performance without compromise: preserves RDMA’s microsecond-level low-latency advantage.
- Comprehensive HA enhancement: supports protection against NIC-level and switch-level failures.
- Ready for mission-critical workloads: reliable and trusted for core business scenarios.
- Financial-grade reliability without FC costs: delivers high reliability without requiring expensive FC networks.
- True deployment of dual-NIC and dual-switch HA architectures: providing reliable protection for core business systems.
Learn more about the upgraded features in SmartX ECP 6.3 from our latest blogs:
SmartX ECP 6.3 Released: Leading the New Standard for Critical Business Support in HCI