Unveiling SmartX ECP 6.3 Upgrades in Availability: Expanding HA to SR-IOV, vGPU, and HCT-Enabled VMs

In many enterprise-critical workloads such as real-time trading, AI inference, and cryptography-related processing, enterprises usually attach hardware devices, including SR-IOV passthrough NICs, GPUs, and encryption cards, to business virtual machines (VMs).

However, due to the inherent limitations of hardware devices in VM passthrough, high availability (HA) cannot be enabled for such VMs. In case of physical host failure, recovery can only be performed via manual intervention, with downtime starting at the hourly level—far from meeting production environment requirements.

SmartX ECP 6.3 addresses this limitation by extending HA support to VMs that use SR-IOV, vGPU, and HCT (Hygon Cryptographic Technology) devices. It introduces a device tagging feature that allows the system to recognize and reattach virtualized devices during recovery.

With this approach, workloads can be restarted automatically after a host failure without sacrificing the performance benefits of hardware devices. Recovery time is reduced from hours to minutes, making it possible to meet business continuity requirements in a wider range of scenarios.

A Common Challenge: The HA Gap for Device-Passthrough Workloads

In scenarios such as low-latency trading, AI inference, cryptography-related deployments, and high-performance computing, VMs often rely on hardware passthrough devices—such as SR-IOV NICs, GPUs, and HCT encryption cards—to achieve near-native performance. However, such VMs cannot enable high availability (HA), which leads to a series of production-level risks:

In the event of a physical host failure, VMs cannot be automatically rebuilt and require manual recovery.
Service downtime can range from several minutes to hours, which is unacceptable for core business operations.
High-performance devices are often restricted to testing environments, making it difficult to deploy and scale them in actual production.
Operations and maintenance (O&M) processes become complex, and failures are unpredictable, forcing enterprises to make a trade-off between performance and high availability.

SmartX ECP 6.3 Breakthrough: Enabling HA for VMs Using SR-IOV, vGPU, and HCT Based on Device Tagging

To address the challenge that high performance and HA are difficult to achieve at the same time, SmartX ECP 6.3 introduces HA capabilities for VMs using three types of hardware virtualization devices: SR-IOV, vGPU, and Hygon HCT. This allows VMs attached to such devices to be automatically rebuilt and quickly recovered in the event of a host failure, truly achieving “no compromise on performance with improved service reliability.” The core objectives of this feature include:

Enabling passthrough devices to evolve from “dedicated, static, and non-HA” to “pooled, schedulable, and HA-enabled.”
Allowing mission-critical workloads in finance and AI scenarios to achieve both extreme performance and high business continuity.
Unifying VM HA policies across SmartX ECP clusters to reduce O&M complexity and the risk of human error.

Technical Deep Dive: Core Feature, HA Workflow, and Product Comparison

Core Feature: Device Tagging

SmartX ECP 6.3 introduces the feature of device tagging to enable unified identification, scheduling, and rebuilding of hardware virtualization devices, which is the key to enabling HA for VMs using these devices.

SR-IOV NICs & HCT cards: Users can customize the device tag.
vGPU: The system automatically generates and matches device tags based on GPU model and partitioning configuration.
When the cluster contains devices with the same tag and of the same type, and with an available count > 0, the VM automatically qualifies for triggering HA.

HA Triggering and Rebuilding Workflow

A physical host experiences an unexpected failure.
The system detects the failure and automatically triggers VM HA.
It then selects a target host within the cluster that has hardware virtualization devices of the same tag and of the same type.
The VM is automatically booted and rebuilt on the target host.
Hardware virtualization devices are automatically re-attached, enabling rapid service recovery.
The entire process requires no manual intervention, reducing downtime from hours to minutes.

Key Highlights

Does not rely on specific hardware and requires no changes to VM configurations.
Fully integrates with existing VM HA frameworks, providing unified policies.
Supports hybrid deployment, enabling unified management of both standard VMs and VMs using SR-IOV, vGPU, or HCT devices.
Automatic failure detection, VM rebuilding, and business recovery.

Product Comparison: SmartX vs. VMware vs. Nutanix

Product Comparison: SmartX vs. VMware vs. Nutanix	SmartX ECP 6.3	VMware/Nutanix and Other HCI Products
SR-IOV NIC VM HA	✅ Supported	❌ Not Supported
vGPU VM HA	✅ Supported	✅ Supported
Hygon HCT encryption card HA	✅ Exclusive Supported	❌ Not Supported
Automatic Failure Rebuild	✅ Supported	✅ Supported
Core Scenario Coverage	Low-Latency Trading & AI	AI & General Scenarios

Overall Advantages

Industry-Exclusive Support: The first in the industry to support HA for VMs using SR-IOV, vGPU, and HCT devices, covering core financial trading and AI scenarios.
Achieves both high performance and HA Protection: Maintains passthrough device performance while providing automatic failure recovery capabilities.
Reduced Business Interruption Risk: Shifts from manual recovery (hours) to automatic rebuild (minutes).
Simplified O&M and Lower Complexity: VMs attached to SR-IOV Passthrough NIC or vGPUs and standard VMs use the same HA, alerting, and monitoring framework.
Enabling Scalable Deployment of Core Workloads: Ensures production-level HA for low-latency trading and AI inference workloads.
Dual assurance of compliance and stability: The cryptography transformation scenario not only meets encryption compliance requirements, but also ensures uninterrupted business continuity.

Use Cases

A leading securities company adopts SmartX ECP with HCT encryption cards for national cryptography transformation, replacing hardware encryption appliances.

Background

During the cryptography transformation of its core business systems (online trading, account management, certificate authentication, and data encryption), the securities company relied heavily on external hardware encryption appliances, which led to a series of persistent challenges:

The procurement cost of hardware encryption appliances is extremely high. Individual devices are expensive, and capacity expansion requires repeated investment.
External encryption cards/appliances use PCIe attachment, occupying slots and increasing cabling and thermal management pressure.
Resource utilization is extremely low. A single encryption appliance can only serve a small number of systems and cannot be shared through virtualization.
O&M is complex. Encryption appliances must be managed, inspected, and maintained separately, and failure recovery depends on manual intervention.
Appliances cannot be integrated with virtualization platforms. VMs running encryption workloads do not support HA, resulting in a single point of failure risk.

To meet the requirements of ITAI and national cryptographic compliance, while simultaneously achieving the three goals of reducing costs, simplifying architecture, and improving HA, the customer urgently needs a new solution that combines chip-level built-in encryption with virtualization-based HA.

Solution and Implementation

Ultimately, the securities company built a new platform based on SmartX ECP 6.3 and Hygon HCT technology, combining “chip-level built-in encryption + HA for passthrough devices”:

The system leverages the built-in HCT cryptographic co-processor in Hygon CPUs to replace traditional external hardware encryption appliances, enabling instruction-level acceleration for SM2/SM3/SM4 algorithms.
VMs access chip-level encryption capabilities through HCT passthrough, achieving near–bare-metal performance with significantly lower latency compared to external encryption cards.
SmartX ECP 6.3 enables HA for VMs using HCT passthrough devices, allowing critical encryption workloads to be automatically rebuilt in the event of a failure.
All resources are unified into the SMTX OS cluster and centrally managed via CloudTower, eliminating the need for a separate encryption appliance management system.

Key Values

1. From “external encryption appliances” to “built-in encryption”
By leveraging the Hygon CPU’s integrated cryptographic module, key storage and cryptographic operations are performed locally on the CPU, without reliance on external devices, resulting in a smaller attack surface and stronger security.

2. Significant cost reduction, with overall investment reduced by approximately 40%–60%
Eliminates the procurement and maintenance costs of encryption appliances and cards, while reusing existing ITAI server compute resources. A single server can effectively replace multiple encryption appliances, improving resource utilization by 3–5 times.

3. HA support for VMs using HCT passthrough devices, eliminating single points of failure for encryption workloads
In the event of a host failure, encryption workload VMs are automatically rebuilt on other nodes within the same cluster, reducing recovery time from hours to minutes and meeting 24×7 compliance requirements.

4. Superior performance compared to traditional encryption cards
Chip-level instruction acceleration significantly improves performance for SSL encryption/decryption and certificate signing/verification, eliminating performance bottlenecks in high-concurrency transaction scenarios.

A futures company uses SR-IOV passthrough to support low-latency trading, achieving near–bare-metal performance

Background

The company’s core trading systems (including ultra-low-latency trading terminals, market data gateways, and order routing gateways) are extremely sensitive to latency:

Traditional virtualized networking (virtio) introduces high latency and jitter, making it unable to meet microsecond-level low-latency requirements.
Solarflare low-latency NICs combined with SR-IOV passthrough are required to achieve production-grade trading performance.
In the legacy architecture, VMs using SR-IOV could not enable HA. In the event of a host failure, recovery had to be performed manually, resulting in a high risk of service interruption.
A large number of physical servers were deployed in a distributed manner, leading to tight rack space, high power consumption, complex O&Ms, and poor resource reuse.

Against this backdrop, the futures company raised a core requirement: achieve both ultra-low latency and HA, both virtualization conveniences and near–physical machine performance.

Solution and Implementation

Based on the SmartX ECP 6.3 low-latency solution, the futures company has built a next-generation infrastructure for core trading :

SR-IOV NIC passthrough: VMs bypass the virtual switch and access physical NICs directly, achieving latency close to bare-metal systems.
Dedicated CPU allocation + NUMA affinity binding: Eliminates scheduling overhead and further reduces latency jitter. >>Learn more
RDMA-based storage network: Minimizes inter-node I/O latency without impacting the trading network.
SR-IOV VM HA: In the event of a host failure, VMs are automatically rebuilt while maintaining MAC/IP consistency, enabling seamless service recovery.

Learn more: SmartX ECP Supports Low-Latency Securities Trading: Achieving Bare-Metal-Level Latency

Key Values and Highlights

1. Network latency reaches near–bare-metal levels
TCP 64B latency is as low as 1.542 μs, and UDP 64B latency is 1.449 μs, fully meeting the requirements of ultra-low-latency futures trading and market data systems.

2. SR-IOV VM HA validation fully passed
HA was triggered by simulating a host failure, and the VM was rebuilt within 25 seconds:
· MAC addresses remained unchanged and IP connectivity was preserved.
· Low-latency performance metrics showed no significant deviation compared to the pre-HA state.
· Fully meeting regulatory transparency requirements.

3. Significantly improved resource utilization
A single physical server can be virtualized into multiple low-latency trading instances, reducing rack space usage, lowering power consumption, and decreasing hardware investment.

4. Zero packet loss in market data receiving
Under high-volume market data replay stress testing, no packet loss or latency spikes were observed, with performance comparable to bare-metal systems.

Scenario Values

Scenario	Core Capabilities of SmartX ECP	Industry Challenges	Core Values
National cryptography transformation for the securities industry	Hygon HCT passthrough + VM HA for passthrough devices	Traditional encryption appliances are expensive, difficult to maintain, and lack HA protection	One-stop delivery of compliance, cost reduction, and HA enhancement: No need for traditional encryption appliances while still meeting regulatory requirements(e.g., MLPS / cryptographic compliance). No additional investment required, while delivering stronger performance and security. No complex operations required, while still achieving HA for encryption workloads.
Low-latency trading for futures industry	SR-IOV passthrough + VM HA for passthrough devices + RDMA	High latency in virtualization environments and lack of HA protection	No compromise on low latency: SR-IOV passthrough delivers performance comparable to high-frequency trading bare-metal systems. No compromise on HA: SR-IOV-based VMs support HA, eliminating single points of failure. Significant cost and efficiency gains: reduced physical server footprint, unified resource pool, and simplified operations.

Conclusion: Achieving Comprehensive VM HA with SmartX ECP 6.3

Beyond HA for VMs using hardware passthrough devices, SmartX ECP 6.3 further introduces new HA features such as RDMA cross-NIC HA, placement group-based availability zone policies, and end-to-end HA alerting. Together, these enhancements strengthen HA capabilities across four dimensions—device, network, scheduling, and operations—building a comprehensive protection system for core business workloads.

Device-Level HA: Provides HA support for VMs using SR-IOV passthrough NICs and GPUs, enabling high-performance workloads to benefit from hardware acceleration while also gaining automated failure recovery.
Network-Level HA: Supports RDMA multi-link cross-NIC bonding, elevating storage network redundancy from the traditional port level to the NIC level. This significantly improves the reliability of high-performance networks while maintaining RDMA’s low-latency, high-throughput characteristics. >>Learn more
Scheduling-Level HA: Placement group rules now include availability zone policies, allowing VMs to be bound to primary or secondary availability zones. This ensures stable operation of workloads under active-active architectures. Regardless of cluster expansion, host replacement, or failure scheduling, VMs remain within the designated zone, preventing a single-zone failure from causing widespread business interruption and enhancing reliability and operability in active-active deployments.
O&M-Level HA: Introduces end-to-end HA alerting, covering key scenarios such as VM HA rebuild success, VM HA rebuild failure, local rebuild failure, and network fault–triggered HA. With clear alerts and event logs, O&M teams can monitor HA execution in real time, quickly identify anomalies, and respond promptly to failures, making HA truly observable, perceivable, and guaranteed.

Leveraging existing standard VM HA capabilities, placement group functionality, and HA priority settings, SmartX ECP can provide more comprehensive HA protection for mission-critical business systems in industries such as finance, healthcare, and manufacturing, helping enterprise users build a stable, efficient, and deployable foundation.

Learn more about the upgraded features in SmartX ECP 6.3 from our latest blogs:

SmartX ECP 6.3 Released: Leading the New Standard for Critical Business Support in HCI

Unveiling SmartX ECP 6.3 Upgrades in DR Capabilities: Native Synchronous Replication and CloudTower HA

Unveiling SmartX ECP 6.3 Upgrades in Availability: Enhancing High-Performance Workload Reliability with Cross-NIC RDMA HA

SmartX and AsiaInfo Security Jointly Launch Agentless Security Protection Solution for Enterprise Cloud Platform

SmartX ECP Supports Low-Latency Securities Trading: Achieving Bare-Metal-Level Latency