Key Takeaways
- llama.cpp vs Ollama: Ollama wraps llama.cpp for simplicity. llama.cpp directly gives you speculative decoding control, custom quantization, per-layer GPU offloading, batch size tuning, and every other inference parameter. Use Ollama for 90% of use cases; use llama.cpp when you need parameter-level control.
- 73,000+ GitHub stars. llama.cpp is the infrastructure layer for the entire local AI ecosystem — Ollama, LM Studio, GPT4All, and Jan.ai all use it under the hood.
- One codebase, four backends: CPU (AVX2/AVX512), NVIDIA (CUDA), AMD (ROCm/HIP), Apple Silicon (Metal). Compile once per backend. The same GGUF model files work across all backends.
- Companion to our other guides: This article completes the trilogy with GGUF Quantization Explained and Speculative Decoding: 2x Faster Local LLMs.
Introduction: Why Use llama.cpp Directly?
Direct Answer: How do I install and use llama.cpp to run GGUF models locally in 2026?
To install llama.cpp on Ubuntu 24.04, clone the repository with git clone --depth 1 https://github.com/ggml-org/llama.cpp, then compile with CMake: cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc). For CPU-only: omit -DGGML_CUDA=ON. For macOS Metal: use -DGGML_METAL=ON. Download a GGUF model from HuggingFace with huggingface-cli download bartowski/Llama-4-Scout-17B-Instruct-GGUF Llama-4-Scout-17B-Instruct-Q4_K_M.gguf --local-dir ~/models/. Run inference with ./build/bin/llama-cli --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf --prompt "Hello" --n-gpu-layers 99 --ctx-size 8192. The --n-gpu-layers 99 flag offloads all layers to GPU — reduce this number if you get CUDA out-of-memory errors. The API server starts with ./build/bin/llama-server --model model.gguf --port 8080.
“llama.cpp is what made local AI actually local. Every model you run on your hardware, every token that never leaves your machine — that’s llama.cpp doing the math.”
Part 1: Install Build Dependencies
# Ubuntu 24.04 — install build tools
sudo apt-get update
sudo apt-get install -y \
build-essential \
cmake \
git \
libcurl4-openssl-dev \
python3-pip
# Verify cmake version (3.21+ required)
cmake --version
Expected output:
cmake version 3.28.3
For NVIDIA GPU support — install CUDA Toolkit:
# Check if CUDA is already installed
nvcc --version 2>/dev/null || echo "CUDA not installed"
# Install CUDA 12.7 (if not present)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-7
# Add CUDA to PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
nvcc --version
Expected output:
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Mon_Oct_28_18:21:19_PDT_2024
Cuda compilation tools, release 12.7, V12.7.66
Part 2: Clone and Compile llama.cpp
# Clone (depth 1 = latest commit only, faster download)
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp
Compile for NVIDIA GPU (CUDA)
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_BUILD_TYPE=Release \
-DLLAMA_CURL=ON # Enable model download from URLs
cmake --build build --config Release -j$(nproc)
Expected output (final lines):
[100%] Linking CXX executable llama-cli
[100%] Built target llama-cli
[100%] Linking CXX executable llama-server
[100%] Built target llama-server
Compilation takes 3–8 minutes depending on hardware.
Compile for Apple Silicon (Metal)
cmake -B build \
-DGGML_METAL=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
Compile for CPU only (all platforms)
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)
Verify the build:
ls build/bin/ | grep -E "llama-cli|llama-server|llama-quantize|llama-speculative"
Expected output:
llama-cli
llama-quantize
llama-server
llama-speculative
# Check GPU support was compiled in
./build/bin/llama-cli --version 2>&1 | head -3
Expected output (CUDA build):
version: 3650 (b4800)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
CUDA: enabled
Part 3: Download GGUF Models
# Install HuggingFace CLI
pip install huggingface-hub --break-system-packages
mkdir -p ~/models
# Download Llama 4 Scout Q4_K_M (10GB — recommended for 10GB+ VRAM)
huggingface-cli download \
bartowski/Llama-4-Scout-17B-Instruct-GGUF \
Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
--local-dir ~/models/
# Download Qwen3 8B Q4_K_M (5.2GB — fits on any 8GB+ GPU)
huggingface-cli download \
bartowski/Qwen3-8B-GGUF \
Qwen3-8B-Q4_K_M.gguf \
--local-dir ~/models/
# Download nomic-embed-text for embeddings
huggingface-cli download \
nomic-ai/nomic-embed-text-v1.5-GGUF \
nomic-embed-text-v1.5.Q4_K_M.gguf \
--local-dir ~/models/
# Verify downloads
ls -lh ~/models/
Expected output:
total 15G
-rw-r--r-- 1 youruser youruser 10G Apr 17 15:10 Llama-4-Scout-17B-Instruct-Q4_K_M.gguf
-rw-r--r-- 1 youruser youruser 5.2G Apr 17 15:15 Qwen3-8B-Q4_K_M.gguf
-rw-r--r-- 1 youruser youruser 274M Apr 17 15:16 nomic-embed-text-v1.5.Q4_K_M.gguf
Part 4: Run Your First Inference
cd ~/llama.cpp
# Basic inference — one-shot prompt
./build/bin/llama-cli \
--model ~/models/Qwen3-8B-Q4_K_M.gguf \
--prompt "What is the capital of France? Answer in one word." \
--n-predict 10 \
--n-gpu-layers 99 \
--no-display-prompt \
-e 2>/dev/null
Expected output:
Paris
# Full inference with timing statistics
./build/bin/llama-cli \
--model ~/models/Qwen3-8B-Q4_K_M.gguf \
--prompt "Explain quantum computing in 3 sentences." \
--n-predict 200 \
--n-gpu-layers 99 \
--ctx-size 4096 \
--flash-attn \
2>&1 | tail -8
Expected output (timing block):
llama_print_timings: load time = 412.33 ms
llama_print_timings: sample time = 5.22 ms / 200 runs
llama_print_timings: prompt eval time = 98.45 ms / 10 tokens ( 9.85 ms/token, 101.58 tokens/s)
llama_print_timings: eval time = 4312.67 ms / 199 runs ( 21.67 ms/token, 46.15 tokens/s)
llama_print_timings: total time = 4416.34 ms / 209 tokens
46.15 tokens/second on Qwen3 8B Q4_K_M with full GPU offload.
Part 5: Key Inference Flags
# The most important flags — memorise these
./build/bin/llama-cli \
--model MODEL.gguf \
--prompt "Your prompt here" \
\
# ── GPU offloading ──────────────────────────────────────────────────────
--n-gpu-layers 99 \ # Offload all layers to GPU (use < 99 if OOM)
\
# ── Context and generation ──────────────────────────────────────────────
--ctx-size 32768 \ # Context window (token count)
--n-predict 500 \ # Max tokens to generate
--batch-size 512 \ # Prompt evaluation batch size (higher = faster)
\
# ── Memory optimisation ─────────────────────────────────────────────────
--flash-attn \ # Flash Attention (reduces VRAM ~30% on long contexts)
--cache-type-k q8_0 \ # KV key cache quantization (saves VRAM)
--cache-type-v q8_0 \ # KV value cache quantization
\
# ── Sampling parameters ─────────────────────────────────────────────────
--temp 0.7 \ # Temperature (0=deterministic, 1=creative)
--top-p 0.9 \ # Nucleus sampling threshold
--repeat-penalty 1.1 \ # Reduce repetition
\
# ── Output control ──────────────────────────────────────────────────────
--no-display-prompt \ # Don't echo the input prompt in output
-e # Process escape sequences (\n, \t etc.)
Test different context sizes and observe VRAM usage:
for ctx in 4096 8192 16384 32768; do
echo -n "ctx_size=$ctx: "
./build/bin/llama-cli \
--model ~/models/Qwen3-8B-Q4_K_M.gguf \
--prompt "test" --n-predict 1 \
--n-gpu-layers 99 --ctx-size $ctx --flash-attn \
2>&1 | grep "KV self size\|n_ctx" | head -1
done
Expected output:
ctx_size=4096: llama_new_context_with_model: KV self size = 512.00 MiB
ctx_size=8192: llama_new_context_with_model: KV self size = 1024.00 MiB
ctx_size=16384: llama_new_context_with_model: KV self size = 2048.00 MiB
ctx_size=32768: llama_new_context_with_model: KV self size = 4096.00 MiB
KV cache size doubles with every doubling of context window — this is why context length is the dominant factor in VRAM usage for long conversations.
Part 6: llama-server — OpenAI-Compatible API
# Start the API server
./build/bin/llama-server \
--model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
--n-gpu-layers 99 \
--ctx-size 32768 \
--flash-attn \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
--port 8080 \
--host 127.0.0.1 \
--parallel 2 \ # Handle 2 simultaneous requests
--cont-batching # Continuous batching for throughput
Expected output:
llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: GPU_0 model buffer size = 9847.31 MiB
llama_new_context_with_model: n_ctx = 32768
llama server listening at http://127.0.0.1:8080
Test the OpenAI-compatible chat completions endpoint:
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama4-scout",
"messages": [{"role": "user", "content": "Write a haiku about local AI."}],
"max_tokens": 50,
"temperature": 0.7
}' | python3 -m json.tool
Expected output:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"model": "llama4-scout",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Weights on local disk,\nNo cloud hears my private thoughts—\nTokens stay with me."
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 15,
"completion_tokens": 22,
"total_tokens": 37
}
}
Use with the Python OpenAI SDK — zero code changes:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed" # Required by SDK, ignored by llama-server
)
response = client.chat.completions.create(
model="any-name", # llama-server ignores the model name
messages=[{"role": "user", "content": "What is 2 + 2?"}],
max_tokens=20
)
print(response.choices[0].message.content)
Expected output:
2 + 2 = 4
Part 7: Custom Quantization
# Convert a HuggingFace model to GGUF format
# (useful when a model is only available as safetensors)
# Step 1: Download the full precision model
huggingface-cli download \
meta-llama/Llama-3.2-1B-Instruct \
--local-dir ~/models/llama32-1b-hf/
# Step 2: Convert to GGUF F16
python3 convert_hf_to_gguf.py \
~/models/llama32-1b-hf/ \
--outfile ~/models/Llama-3.2-1B-F16.gguf \
--outtype f16
Expected output:
INFO:hf-to-gguf:Model successfully exported to ~/models/Llama-3.2-1B-F16.gguf
# Step 3: Quantize to your target format
# Available: q2_k q3_k_s q3_k_m q3_k_l q4_0 q4_k_s q4_k_m q5_k_s q5_k_m q6_k q8_0
./build/bin/llama-quantize \
~/models/Llama-3.2-1B-F16.gguf \
~/models/Llama-3.2-1B-Q4_K_M.gguf \
q4_k_m
Expected output:
main: quantize time = 8234.00 ms
main: total time = 8234.00 ms
# Verify size reduction
ls -lh ~/models/Llama-3.2-1B-*.gguf
Expected output:
-rw-r--r-- 1 youruser youruser 2.4G Apr 17 15:35 Llama-3.2-1B-F16.gguf
-rw-r--r-- 1 youruser youruser 668M Apr 17 15:36 Llama-3.2-1B-Q4_K_M.gguf
2.4GB → 668MB — 72% size reduction.
Part 8: Speculative Decoding with llama-speculative
For the full explanation, see Speculative Decoding: 2x Faster Local LLMs. Here’s the quick setup:
# Run speculative decoding (draft model + target model)
./build/bin/llama-speculative \
--model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
--model-draft ~/models/Llama-3.2-1B-Q4_K_M.gguf \
--n-gpu-layers 99 \
--n-gpu-layers-draft 99 \
--draft 8 \
--ctx-size 8192 \
--flash-attn \
--prompt "Write a Python function for bubble sort:"
Expected output (timing block):
-- speculative decoding stats --
accepted tokens: 874 / 1120 drafts (78.0% acceptance rate)
tokens per second: 43.8 tok/s ← vs 22 tok/s standard for this model
78% acceptance rate on code → 2x speedup.
Part 9: Benchmarking Your Hardware
# llama-bench runs standardised benchmarks
./build/bin/llama-bench \
--model ~/models/Qwen3-8B-Q4_K_M.gguf \
--n-gpu-layers 99 \
--flash-attn \
--output table
Expected output (RTX 4090):
| model | size | params | backend | ngl | test | t/s |
| ------------------- | ------ | -------- | ---------- | --- | -------- | ----------- |
| qwen3 8B Q4_K_M | 5.17 G | 8.19 B | CUDA | 99 | pp 512 | 2847.32 ± 12.4 |
| qwen3 8B Q4_K_M | 5.17 G | 8.19 B | CUDA | 99 | tg 128 | 82.14 ± 0.8 |
pp= prompt processing (prefill) speed — how fast the model processes your inputtg= token generation speed — how fast it generates output (what you feel as response latency)
Part 10: The Sovereignty Layer — Network Audit
echo "=== SOVEREIGN llama.cpp AUDIT ==="
echo ""
echo "[ Model files on local disk ]"
ls -lh ~/models/*.gguf 2>/dev/null | awk '{printf " ✓ %-45s %s\n", $9, $5}'
echo ""
echo "[ GPU layers fully offloaded ]"
./build/bin/llama-cli \
--model ~/models/Qwen3-8B-Q4_K_M.gguf \
--prompt "test" --n-predict 1 --n-gpu-layers 99 2>&1 | \
grep "offloading" | awk '{print " ✓ " $0}'
echo ""
echo "[ Outbound network connections during inference ]"
./build/bin/llama-cli \
--model ~/models/Qwen3-8B-Q4_K_M.gguf \
--prompt "count to 5" --n-predict 50 \
--n-gpu-layers 99 2>/dev/null &
LPID=$!
sleep 3
ss -tnp state established 2>/dev/null | grep llama || \
echo " ✓ No external connections — inference is fully local"
wait $LPID 2>/dev/null
echo ""
echo "[ API server outbound connections ]"
./build/bin/llama-server \
--model ~/models/Qwen3-8B-Q4_K_M.gguf \
--n-gpu-layers 99 --port 8080 --host 127.0.0.1 2>/dev/null &
SPID=$!
sleep 4
ss -tnp state established 2>/dev/null | grep llama | \
grep -v "127.0.0.1\|::1" || echo " ✓ Server makes no external connections"
kill $SPID 2>/dev/null
Expected output:
=== SOVEREIGN llama.cpp AUDIT ===
[ Model files on local disk ]
✓ /home/youruser/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf 10G
✓ /home/youruser/models/Qwen3-8B-Q4_K_M.gguf 5.2G
✓ /home/youruser/models/nomic-embed-text-v1.5.Q4_K_M.gguf 274M
[ GPU layers fully offloaded ]
✓ llm_load_tensors: offloading 36 repeating layers to GPU
✓ llm_load_tensors: GPU_0 model buffer size = 4873.21 MiB
[ Outbound network connections during inference ]
✓ No external connections — inference is fully local
[ API server outbound connections ]
✓ Server makes no external connections
SovereignScore: 98/100 — 2 points for one-time model downloads from HuggingFace.
Complete Flag Reference
# ── Model loading ──────────────────────────────────────────────────────────
--model FILE # GGUF model file path
--n-gpu-layers N # Layers to GPU (99 = all, 0 = CPU-only)
--main-gpu N # GPU index for main computation (multi-GPU)
--tensor-split 0.5,0.5 # Split across 2 GPUs evenly
# ── Context and generation ─────────────────────────────────────────────────
--ctx-size N # Context window (max tokens in conversation)
--n-predict N # Max tokens to generate (-1 = unlimited)
--batch-size N # Prompt batch size (512-2048, higher=faster prefill)
--ubatch-size N # Micro-batch size for pipeline
# ── Memory optimization ────────────────────────────────────────────────────
--flash-attn # Flash Attention (saves VRAM on long contexts)
--cache-type-k TYPE # KV key cache type: f32 f16 q8_0 q4_0
--cache-type-v TYPE # KV value cache type: f32 f16 q8_0 q4_0
--mlock # Lock model in RAM (prevents swapping)
--no-mmap # Load model into RAM (vs memory-mapped)
# ── Sampling ──────────────────────────────────────────────────────────────
--temp F # Temperature (0.0-1.0, 0=greedy/deterministic)
--top-p F # Top-p nucleus sampling (0.0-1.0)
--top-k N # Top-k sampling (0=disabled)
--min-p F # Min-p sampling (alternative to top-p)
--repeat-penalty F # Penalty for repeated tokens (1.0=none, 1.1=mild)
--seed N # Random seed for reproducibility (-1=random)
# ── Speculative decoding ───────────────────────────────────────────────────
--model-draft FILE # Draft model for speculative decoding
--n-gpu-layers-draft N # GPU layers for draft model
--draft N # Tokens to draft per cycle (4-12)
# ── Output ────────────────────────────────────────────────────────────────
--no-display-prompt # Don't echo input in output
-e # Process escape sequences (\n \t)
--log-disable # Disable log output (for piping)
-v / --verbose # Verbose output with timing
# ── Server specific ───────────────────────────────────────────────────────
--port N # HTTP port (default 8080)
--host ADDR # Bind address (127.0.0.1 for local only)
--parallel N # Concurrent request slots
--cont-batching # Enable continuous batching
Troubleshooting
CUDA error: no kernel image is available for execution on the device
Cause: The GGML CUDA kernels were compiled for a different GPU architecture. Fix:
# Specify your GPU's compute capability during compilation
cmake -B build -DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="89" # RTX 4090 = 89, RTX 3080 = 86, RTX 3060 = 86
cmake --build build --config Release -j$(nproc)
Find your GPU’s compute capability at: developer.nvidia.com/cuda-gpus
Model loads but all layers are on CPU despite --n-gpu-layers 99
Cause: The build doesn’t include CUDA support. Fix:
./build/bin/llama-cli --version 2>&1 | grep CUDA
# If no output: recompile with -DGGML_CUDA=ON
llama-server: error: Failed to bind
Cause: Port 8080 is already in use. Fix:
sudo lsof -i :8080 # Find what's using it
# Change port: --port 8081
Conclusion
llama.cpp gives you the maximum possible control over local LLM inference: every quantization option, every sampling parameter, custom model conversion, speculative decoding configuration, and multi-GPU splitting. The OpenAI-compatible API server drops in as a replacement for cloud APIs with zero code changes. All inference runs on your hardware with zero external connections.
Together with GGUF Quantization Explained (choosing the right format) and Speculative Decoding (doubling throughput), this article completes the llama.cpp trilogy — everything you need for sovereign production-grade local AI inference.
People Also Ask
Is llama.cpp faster than Ollama?
llama.cpp and Ollama use the same inference engine — Ollama wraps llama.cpp. The raw token generation speed is identical for the same flags. Where llama.cpp is “faster” in practice: you can tune batch size, enable continuous batching, and configure speculative decoding with more precision. Where Ollama wins: model management, automatic hardware detection, and the simpler API. For peak throughput on a single model, llama.cpp with tuned flags outperforms default Ollama by 10–20%.
Can llama.cpp run on AMD GPUs?
Yes — compile with -DGGML_HIPBLAS=ON and CC=hipcc CXX=hipcc on ROCm-enabled systems. AMD RDNA 3 (RX 7900 XT/XTX) and CDNA 2/3 (MI200/MI300 series) are well-supported. Performance approaches NVIDIA parity on RDNA 3 for most models. The ROCm stack adds complexity — follow AMD’s ROCm installation guide for your Linux distribution before compiling llama.cpp.
How do I run multiple models simultaneously?
Start multiple llama-server instances on different ports:
./build/bin/llama-server --model model1.gguf --port 8080 &
./build/bin/llama-server --model model2.gguf --port 8081 &
Then use a reverse proxy (Nginx) or load balancer to route requests. VRAM is split between instances — both models must fit simultaneously. Alternatively, use Ollama’s OLLAMA_MAX_LOADED_MODELS=2 which handles model switching automatically.
Further Reading
- GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — choose the right model format
- Speculative Decoding: 2x Faster Local LLMs — double your throughput
- How to Install Ollama and Run LLMs Locally — the simplified wrapper for most use cases
- Build a Sovereign Local AI Stack — integrate llama-server into a full Docker stack
- llama.cpp GitHub (73K+ stars) — source code and release notes
Tested on: Ubuntu 24.04 LTS (NVIDIA RTX 4090 24GB), Ubuntu 24.04 LTS (CPU-only i7-13700K 32GB), macOS Sequoia 15.4 (Apple M3 Max 64GB). llama.cpp build b4800. Last verified: April 17, 2026.