Compiling vLLM for Ampere Altra (Max)
In preparation for teaching a class on LLMOps and developing agentic AI systems, I’ve been experimenting with solutions for serving LLMs to multiple users efficiently. I previously wrote a blog post about using Ampere’s fork of llama.cpp for serving smaller LLMs without GPUs.
While Ollama and llama.cpp are great for a single user, they are not particularly well suited for solving multiple concurrent users. vLLM is another open-source LLM serving engine that is quickly becoming the de facto open-source solution for production LLM serving. vLLM uses an architecture designed to maintain throughput for multiple users while capping latency. vLLM has roots in the academic community, employing techniques like PagedAttention and smarter scheduling to achieve more efficient batching.
The Problem
That said, I ran into issues getting vLLM to run for CPU inference. I followed the user guide instructions to install pre-compiled Python wheels for Arm65. vLLM would successfully start and warm up the model, but as soon as I made a request, the vLLM engine would segfault and die.
(I had no problems with the standard instructions for GPU inference using the CUDA backend on the same system.)
The Solution
I ultimately realized that the precompiled version might be assuming that my processor supports the BF16 instructions. While the AmpereOne processors do, my Ampere Altra (Max) CPU doesn’t. I was able to get vLLM working by building compiling it from source for my particular system. Thankfully, the instructions for building in Docker make this pretty easy.
# download the source for a release
$ wget https://github.com/vllm-project/vllm/releases/download/v0.17.1/vllm-0.17.1.tar.gz
# extract it
$ tar -xf vllm-0.17.1.tar-gz
# change directories
$ cd vllm-0.17.1
# build
$ docker build -f docker/Dockerfile.cpu \
--tag vllm-cpu-arm64 \
--target vllm-openai .
# run; replace <hf-token> with your HuggingFace token
$ docker run -d \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 8000:8000 \
-e "OMP_NUM_THREADS=32" \
-e "VLLM_CPU_OMP_THREADS_BIND=nobind" \
-e "HF_TOKEN=<hf-token>" \
--name vllm \
vllm-cpu-env:latest \
--max-model-len 16384 \
--max-num-batched-tokens 2048 \
meta-llama/Llama-3.2-3B-Instruct
# tail logs
$ docker logs vllm -f
A few notes:
- Use
OMP_NUM_THREADSto determine the number of cores to use - Use
VLLM_CPU_OMP_THREADS_BINDto pin threads to specific cores for better performance - Use
max-model-lento set the maximum context size for a single request - Use
max-num-batched-tokensto set the number of tokens per batch. Default is 512. Higher values increase throughput at the cost of latency.
(I found Red Hat’s summary of key server arguments as well as the vLLM config documentation useful.)
Conclusions
I haven’t conclusively benchmarked vLLM versus Llama.cpp. So far, however, I can say that:
- vLLM using a single NVidia GeForce RTX 5060 TI (16 GB VRAM) scales really well to 32 multiple concurrent connections without a significant drop in tokens per second or increase in overall request completion time compared to a single user.
- vLLM is NOT optimized for CPU. It is very, very slow.
- Llama.cpp doesn’t scale well beyond a handful of concurrent connections.
I still need to evaluate Llama.cpp performance using my GPU. And I’m looking forward to evaluating vLLM performance on AmpereOne M. :)