Skip to content

vllm-project/vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

vLLM is an open-source LLM inference and serving engine built for high throughput and memory efficiency. It uses PagedAttention to maximize GPU utilization, supports OpenAI-compatible APIs, and can serve Llama, DeepSeek, Qwen, Mistral, and hundreds of other models at scale.

80,574 stars17,011 forksPythonUpdated May 2026
✅ Reviewed by My AI Guide, vetted for vibe builders

Our Review

In 2023, a team at UC Berkeley's Sky Computing Lab identified GPU memory fragmentation -- not raw compute speed -- as the primary throttle on LLM serving throughput. Standard serving pre-allocates contiguous memory for each request's KV cache, leaving large chunks stranded when output lengths vary. Their fix, PagedAttention, manages the KV cache in non-contiguous pages the way an operating system manages virtual memory, nearly eliminating waste. Open-sourced as vLLM, it produced 2-24x higher throughput on the same hardware and became the de facto standard that production LLM serving stacks are measured against in 2026.

Key capabilities

  • PagedAttention: near-zero KV cache memory waste allows batching more requests simultaneously, directly increasing tokens-per-second throughput
  • OpenAI-compatible API server: drop-in replacement for the OpenAI REST API -- any app using the OpenAI SDK works with vLLM by changing the base URL
  • Continuous batching: requests are batched dynamically rather than padded to fixed lengths, maximizing GPU utilization under variable load
  • Speculative decoding: a small draft model generates token candidates that the large model verifies in parallel, reducing latency for greedy decoding
  • Multi-GPU and multi-node tensor parallelism: shard large models across 2, 4, 8, or more GPUs with built-in tensor and pipeline parallel support
  • Wide model support: Llama 3, DeepSeek-V3, Qwen3, Mistral, Gemma, Phi-4, Kimi, and 100+ others run natively without custom adapters

Getting started

Install with pip install vllm on a CUDA-capable machine. Start an OpenAI-compatible server with vllm serve meta-llama/Llama-3.1-8B-Instruct. Point your existing OpenAI SDK calls at http://localhost:8000/v1 and serve any supported model locally or on your own GPU cloud.

Limitation

vLLM requires NVIDIA CUDA (ROCm support for AMD exists but is less mature), and meaningful throughput gains need at least a 24GB GPU. CPU inference is not a primary use case -- for that, llama.cpp is the better choice. vLLM also requires more DevOps experience than managed inference APIs.

Our Verdict

vLLM is the correct choice for teams that need to serve LLMs at scale on their own hardware in 2026. The PagedAttention architecture is a genuine algorithmic improvement over naive serving -- not a configuration trick -- and the throughput gains translate directly to cost savings at scale. An 80,000+ star count and adoption by major AI labs (Mistral, Qwen, DeepSeek all test against it) signal production credibility.

The OpenAI-compatible API server is the practical reason most teams choose vLLM. You can migrate existing applications from the OpenAI API to self-hosted inference by changing one URL and one API key. There is no refactoring required, no SDK change, no application code to touch.

The limitation is the hardware bar. vLLM is built for GPU servers, not developer laptops. Teams evaluating local inference on consumer hardware should look at llama.cpp or Ollama instead. vLLM's value proposition kicks in at the point where you're running multiple concurrent users or need to control per-token cost at meaningful throughput.

Frequently Asked Questions

What is PagedAttention and why does it matter for LLM serving?

PagedAttention is vLLM's key memory management innovation. Standard LLM serving pre-allocates contiguous GPU memory for each request's KV cache, wasting memory for variable-length outputs. PagedAttention manages KV cache in non-contiguous pages (like OS virtual memory), eliminating fragmentation and enabling far more simultaneous requests on the same GPU hardware in 2026.

How does vLLM compare to llama.cpp for serving LLMs?

vLLM and llama.cpp serve different use cases. vLLM is optimized for high-throughput serving under concurrent load on NVIDIA GPUs -- it excels when many users query the same model simultaneously. llama.cpp is optimized for single-user local inference on CPU or consumer GPUs. For production APIs with hundreds of requests per minute, vLLM wins. For a local chatbot, llama.cpp is simpler and more portable.

Is vLLM compatible with the OpenAI API?

Yes. vLLM includes a built-in OpenAI-compatible REST API server. Run vllm serve with any supported model, and existing code using the openai Python SDK works by changing the base_url parameter to your vLLM server address. Chat completions, streaming, function calling, and embeddings are all supported endpoints.

What models can vLLM serve?

vLLM supports over 100 model architectures including Llama 3, DeepSeek-V3, Qwen 3, Mistral, Gemma 2, Phi-4, Kimi K2, and most major open-weight LLMs from Hugging Face. Models load directly from Hugging Face Hub or local directories. Quantized models (GPTQ, AWQ, FP8) are also supported for reducing memory requirements.

What hardware does vLLM require?

vLLM primarily targets NVIDIA CUDA GPUs. For production serving, 24GB VRAM (A10, 3090, 4090) is the practical minimum for 7B models; larger models need multiple GPUs via tensor parallelism. AMD ROCm support exists in 2026 but has fewer optimizations. CPU inference is technically possible but not the intended use case -- llama.cpp is better suited for CPU workloads.

What is vllm?

vLLM is an open-source LLM inference and serving engine built for high throughput and memory efficiency. It uses PagedAttention to maximize GPU utilization, supports OpenAI-compatible APIs, and can serve Llama, DeepSeek, Qwen, Mistral, and hundreds of other models at scale.

How do I install vllm?

Visit the GitHub repository at https://github.com/vllm-project/vllm for installation instructions.

What license does vllm use?

vllm uses the Apache-2.0 license.

What are alternatives to vllm?

Explore related tools and alternatives on My AI Guide.

🔒

Open source & community-verified

Apache-2.0 licensed: free to use in any project, no strings attached. 80,574 developers have starred this, meaning the community has reviewed and trusted it.

Reviewed by My AI Guide for relevance, quality, and active maintenance before listing.

Topics

llminferencellm-servingopenaicudapytorchllamadeepseekqwen

Related Tools

View all