SmartX Releases SKS 1.2, Enhancing Kubernetes Support for AI Scenarios

March 21st, 2024 – Modern IT infrastructure innovator SmartX releases SMTX Kubernetes Service 1.2 (SKS), a production-ready container management and service product. Updates include enhanced support for high-performance computing scenarios such as AI, multiple CPU architectures, and optimized management and use of container images. With these updates, SKS 1.2 helps users apply AI scenarios in containers based on easily created Kubernetes clusters.

Background

SMTX Kubernetes Service (SKS) is based on SmartX HCI and integrates industry-leading SmartX virtualization, distributed storage, networking and security as well as other product components. It aims to help IT operations teams easily deploy and manage production-ready Kubernetes clusters on servers with multiple CPU architectures.

SKS is CNCF-certified and supports multiple addons in the CNCF ecosystem, it is easy to use, production-ready, and has no vendor lock-in.

Multiple industry customers have deployed SKS in production to build Kubernetes clusters and achieved unified management of virtualization and container environments.

SKS 1.2 now supports GPU to adapt to multiple scenarios, especially the increasing AIGC (AI-generated content) demands. It also supports AArch64 CPU architecture and openEuler OS to meet diverse customer needs.

Updates

AI & Other High-Performance Computing Scenarios

SKS 1.2 provides comprehensive support for various NVIDIA GPU models, allowing flexible use of different GPU models in a single cluster. By leveraging the GPU resources provided by SmartX Hyper-converged Cluster, users can empower Kubernetes workloads with efficient parallel computing capabilities and maximize GPU computing power utilization through various features.

SKS 1.2 supports not only direct assignment of physical GPUs to workloads, but also the following shared GPU modes:

Virtual GPU (vGPU): A single physical GPU can be split into multiple vGPUs to be shared by different nodes.
Time-Slicing: A GPU or vGPU can allow multiple processes to use it concurrently through a time-sharing mechanism.
Multi-Instance GPU (MIG): Supports splitting a single physical GPU into multiple independent instances to ensure resource isolation between instances.
Multi-Process Service (MPS): Allows multiple processes to share the computing resources of a single GPU, effectively reducing the performance loss caused by process switching.

SKS 1.2 also supports flexible and elastic management of GPU resources through Kubernetes management methods:

Supports setting automatic elastic scaling policies for nodes with mounted GPUs. The number of nodes with mounted GPUs can be dynamically increased or decreased according to business needs.
When a node with a mounted GPU fails, the system can automatically create and replace it with a new node with the same GPU resources, ensuring continuous business operation and high resource availability.

These new features greatly improve the utilization, management efficiency, and flexibility of GPU resources, and can more effectively cope with challenges such as fluctuations in resource demand and potential node failures.

Based on the above features, SKS 1.2 can use both the CPU and GPU of SMTX OS Cluster to provide the necessary computing power for the workload cluster, and can handle various high-performance computing tasks such as AI. Here are some common scenarios:

Machine Learning and Deep Learning Training: Supports the deployment of deep learning models such as TensorFlow and PyTorch, and ensures they have the necessary computing resources and GPU support. It also allows horizontal scaling to meet the processing needs of large-scale datasets and complex models.
Rendering and Graphics Processing: Supports the deployment of GPUs for graphics-intensive tasks such as rendering, which can be used to accelerate GPU rendering for movies and game development.
High-Performance Computing (HPC): Suitable for large-scale scientific and engineering computing, such as simulation, and supports horizontal scaling to improve performance.
Data Analysis and Scientific Computing: Suitable for deploying data analysis and scientific computing applications such as Apache Spark and NumPy, and supports horizontal scaling to improve computing efficiency.

Please check out the detailed demo video here.

Support Multiple CPU Architectures

SKS 1.2 supports creating and deploying Kubernetes workload clusters on SMTX OS clusters with AArch64 CPU architectures for diverse customer needs. It also supports the use of openEuler OS on Kubernetes nodes.

Optimize Management & Use of Container Images

SKS 1.2 also enhances the management and use of container images, and supports users to configure trusted container image registries. Users can quickly create virtual machine-based container image registries (Harbor) on SMTX OS clusters to provide container image services for SKS workload clusters. SKS 1.2 also supports users to configure third-party container image registries that are self-maintained for workload clusters.

Continuously Optimize User Experience

To continuously optimize the user experience, SKS 1.2 also introduces several new features and improvements in terms of management and operation:

Supports displaying information about the Control Plane virtual IP and SKS container image registry of the management cluster for quick access during operation and maintenance.
By default, the Pod IP CIDR and Service IP CIDR values of the management cluster are used when creating a workload cluster to avoid conflicts between the system default values and the user’s network configuration.
When upgrading a workload cluster, if the virtual machine template is not distributed to the selected cluster, a text prompt will guide the user to manually distribute the corresponding virtual machine template in the content library to avoid task timeout failure due to non-distribution.
Supports displaying cluster events of the management cluster to understand the status of the management cluster.
Supports displaying the CPU allocation and memory allocation of each node in the node information of the management cluster.

More Information

To learn more about SKS 1.2, please visit the official website, or download the product brief. If you have any questions, please join us on Slack for inquiries.