Skip to content

ggml-org/llama.cpp

LLM inference in C/C++

llama.cpp makes it possible to run large language models like Llama, Mistral, Phi, and Qwen entirely on your own hardware -- no API key, no cloud dependency. Written in pure C/C++ with MIT license, it runs on Mac, Linux, and Windows with CPU, GPU, and Apple Silicon support.

111,793 stars18,499 forksC++Updated May 2026
✅ Reviewed by My AI Guide, vetted for vibe builders

Our Review

Georgi Gerganov published llama.cpp in March 2023, three months after ChatGPT launched, with a goal most considered impossible: running a 7B-parameter language model on a MacBook M1 at interactive speed. The project worked, shipped, and within months became the backbone of Ollama, LM Studio, Jan, and dozens of other local AI applications -- all of which wrap llama.cpp at their core.

Key capabilities

  • GGUF model format: every quantized model on Hugging Face runs directly without Python, Torch, or any ML framework installed
  • Metal + CUDA + Vulkan backends: offloads computation to Apple Metal, NVIDIA CUDA, or AMD/Intel Vulkan depending on hardware
  • Aggressive quantization: a 7B-parameter model compresses to run on 4GB RAM; a 13B model fits on an M1 MacBook Pro
  • Built-in HTTP server: llama-server exposes an OpenAI-compatible REST API so existing apps need only a URL swap to use local models
  • Multi-modal support: recent builds handle vision models (LLaVA, Moondream) and structured JSON output via grammar-based sampling

Limitation

Setup requires compiling from source with C++ tools. You choose quantization levels manually and manage VRAM budgets yourself. Non-technical users are better served by Ollama or LM Studio, which both wrap llama.cpp in a friendlier shell.

Our Verdict

llama.cpp is the bedrock of the local LLM movement. If you run AI locally -- whether through Ollama, LM Studio, or directly -- you're almost certainly running llama.cpp underneath. The MIT license, zero dependencies outside the C++ standard library, and relentless performance optimization make it the right default for anyone building local inference infrastructure.

For developers building applications that need local inference -- privacy-first tools, offline-capable products, or simply eliminating API costs at scale -- llama.cpp is the answer. The OpenAI-compatible server mode means integration is straightforward: swap your endpoint URL and keep your existing code.

The main trade-off is accessibility. llama.cpp rewards developers who understand how to compile software, interpret quantization trade-offs, and tune runtime flags. Users who want a GUI or a one-click install should reach for Ollama or LM Studio instead, both of which wrap llama.cpp in a friendlier shell.

Frequently Asked Questions

What models can I run with llama.cpp?

Any model available in GGUF format -- which covers Llama 3, Mistral, Phi-4, Qwen 2.5, DeepSeek, Gemma 2, and hundreds more. Hugging Face hosts thousands of pre-quantized GGUF files. Download the file, point llama.cpp at it, and inference starts immediately with no additional setup required.

Do I need a GPU to use llama.cpp?

No. llama.cpp runs entirely on CPU and is specifically optimized for it. A GPU accelerates inference significantly -- especially for larger models -- but a modern CPU handles 7B-parameter models at comfortable speeds. Apple Silicon (M1-M4) benefits from unified memory and Metal GPU offloading automatically.

What is GGUF and how does it differ from other model formats?

GGUF is llama.cpp's native model format. It bundles weights, vocabulary, and metadata into a single file with quantization built in. Unlike PyTorch or SafeTensors formats, GGUF models run without Python, Torch, or CUDA installed. Multiple quantization levels (Q4, Q5, Q8) let you trade quality for memory footprint.

How does llama.cpp compare to Ollama or LM Studio?

Ollama and LM Studio both use llama.cpp as their inference backend. They add user-friendly interfaces -- Ollama with a CLI and API, LM Studio with a desktop GUI. If you want maximum control with minimal overhead, use llama.cpp directly. If you want simpler setup, those tools are the right choice.

Is llama.cpp suitable for production inference serving?

For single-user or small-team workloads, yes -- the built-in llama-server is production-quality with batching support. For high-throughput serving at scale, vLLM or dedicated inference services are better suited. llama.cpp excels at edge, embedded, and local deployment where minimal dependencies matter most in 2026.

What is llama.cpp?

llama.cpp makes it possible to run large language models like Llama, Mistral, Phi, and Qwen entirely on your own hardware -- no API key, no cloud dependency. Written in pure C/C++ with MIT license, it runs on Mac, Linux, and Windows with CPU, GPU, and Apple Silicon support.

How do I install llama.cpp?

Visit the GitHub repository at https://github.com/ggml-org/llama.cpp for installation instructions.

What license does llama.cpp use?

llama.cpp uses the MIT license.

What are alternatives to llama.cpp?

Explore related tools and alternatives on My AI Guide.

🔒

Open source & community-verified

MIT licensed: free to use in any project, no strings attached. 111,793 developers have starred this, meaning the community has reviewed and trusted it.

Reviewed by My AI Guide for relevance, quality, and active maintenance before listing.

Topics

ggml

Related Tools

View all