Deploying Large 405B Models in Full Precision on Runpod
Sep 24
3 min read
0
262
0
In the rapidly evolving field of artificial intelligence, the ability to run large language models (LLMs) at full precision is becoming increasingly important. Today, we'll explore how to deploy massive models like Meta Llama-3.1 405B in full precision using Runpod and AMD's MI300X GPUs.
The Power of Full Precision
Running large models in full precision might not be required in all scenarios. But full precision provides the most accurate answers possible and also generally perform better at longer contexts. But all of this comes at higher inference time compute requirements and slower inferencing. Using full precision is still a good idea to benchmark performance capabilities even if you end up using a lower precision for your AI applications.
Hardware Comparison: AMD MI300X vs NVIDIA H100
The hardware landscape for running massive LLMs is dominated by two major players: AMD's MI300X and NVIDIA's H100. While both are designed for high-performance computing, they have distinct characteristics that make them suitable for different use cases.
The AMD Instinct MI300X is based on the CDNA 3 architecture, optimized for AI and HPC workloads. It features a staggering 192GB of HBM3 memory, making it particularly well-suited for large language models and AI inference tasks. The MI300X is designed as a pure, high-performance GPU, using only CDNA 3 GPU tiles rather than a mix of CPU and GPU tiles found in its MI300A counterpart.
In contrast, the NVIDIA H100, built on the Hopper architecture, comes with 80GB of HBM2e memory. While the H100 is known for its exceptional performance in both training and inference tasks, its lower memory capacity can be a limitation when working with the largest models.
The MI300X's massive memory capacity allows it to keep entire large models in memory during inference, potentially reducing memory-related bottlenecks and improving performance for certain AI workloads. This makes the MI300X particularly attractive for inference tasks with large models like Llama 405B.
Deploying on Runpod with AMD MI300X
To deploy meta-llama/Meta-Llama-3.1-405B-Instruct on Runpod using AMD MI300X GPUs, follow these steps:
Select a container image that supports ROCm and vLLM. We recommend using `rocm/vllm:rocm6.2_mi300_ubuntu22.04_py3.9_vllm_7c5fd50`, which can be found here: https://hub.docker.com/r/rocm/vllm/tags
Provision at least 1000GB (1TB) of storage and select 8x AMD MI300X GPUs for your pod.
Set up a network storage volume to persist the model between sessions and avoid long download times.
Launch the model using the following command:
python3 -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-405B-Instruct --dtype float16 --tensor-parallel-size 8 --worker-use-ray
Running a 405B parameter model in full precision presents several challenges. The initial startup time can be considerable, often taking 30-60 minutes to load the model into GPU memory. Fortunately, Runpod allows you to rent servers by the hour and this can be helpful in reducing costs.
ROCm vs CUDA: The Software Ecosystem
While NVIDIA's CUDA has long been the dominant platform for GPU computing in AI, AMD's ROCm (Radeon Open Compute) is gaining traction. ROCm is an open-source platform designed to support high-performance computing, machine learning, and AI workloads on AMD GPUs.
ROCm's adoption is growing, particularly in the HPC and AI communities. It supports popular frameworks like TensorFlow and PyTorch, although the level of optimization may not yet match that of CUDA. The open-source nature of ROCm allows for greater flexibility and transparency, which is attractive to many researchers and organizations.
However, it's important to note that ROCm support is still limited in some areas. For example, vLLM does not support parameters like `--pipeline-parallel-size` when using ROCm, which can limit inference speed optimization options compared to CUDA.
Deploying massive models like Llama 3.1 405B in full precision is a challenging but rewarding endeavor. AMD's MI300X GPUs, with their enormous memory capacity, are making it more accessible to run these models efficiently, particularly for inference tasks. While challenges remain, particularly in the software ecosystem with ROCm still catching up to CUDA in some areas, the landscape is rapidly evolving.