What GPU backends does llama.cpp support?

NVIDIA: CUDA. AMD: ROCm. Apple: Metal. Intel: oneAPI or CPU. Build with flag: `LLAMA_CUDA=1`, `LLAMA_ROCM=1`, `LLAMA_METAL=1`

How do I enable faster GPU acceleration?

1. Compile with GPU: `make LLAMA_CUDA=1`. 2. Use GGUF model (not F32). 3. Run with `-ngl 35` (offload to GPU). 4. Increase `-b 512`. Expect 50-200 tokens/sec vs 1-5 on CPU

Can I run multiple models in parallel?

Single process: one model at a time. For parallelism: run multiple `./server` on different ports. Or use vLLM for production multi-model

What quantization is best?

Q4_K_M: best all-around. Q8_0: high-quality. F16: maximum quality. For RTX 4090: F16. For RTX 3090: Q8_0. For RTX 2080: Q4_K_M

How do I use llama.cpp for RAG?

Embed documents with server endpoint. Store in pgVector/Milvus. At query: retrieve chunks, prepend to prompt, run generate. Use langchain-llama-cpp

llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU | local-ai-on-device

Key Takeaways

llama.cpp vs Ollama: Ollama wraps llama.cpp for simplicity. llama.cpp directly gives you speculative decoding control, custom quantization, per-layer GPU offloading, batch size tuning, and every other inference parameter. Use Ollama for 90% of use cases; use llama.cpp when you need parameter-level control.
73,000+ GitHub stars. llama.cpp is the infrastructure layer for the entire local AI ecosystem — Ollama, LM Studio, GPT4All, and Jan.ai all use it under the hood.
One codebase, four backends: CPU (AVX2/AVX512), NVIDIA (CUDA), AMD (ROCm/HIP), Apple Silicon (Metal). Compile once per backend. The same GGUF model files work across all backends.
Companion to our other guides: This article completes the trilogy with GGUF Quantization Explained and Speculative Decoding: 2x Faster Local LLMs.

Introduction: Why Use llama.cpp Directly?

Direct Answer: How do I install and use llama.cpp to run GGUF models locally in 2026?

To install llama.cpp on Ubuntu 24.04, clone the repository with git clone --depth 1 https://github.com/ggml-org/llama.cpp, then compile with CMake: cmake -B build -DGGML_CUDA=ON && cmake --build build --config Release -j$(nproc). For CPU-only: omit -DGGML_CUDA=ON. For macOS Metal: use -DGGML_METAL=ON. Download a GGUF model from HuggingFace with huggingface-cli download bartowski/Llama-4-Scout-17B-Instruct-GGUF Llama-4-Scout-17B-Instruct-Q4_K_M.gguf --local-dir ~/models/. Run inference with ./build/bin/llama-cli --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf --prompt "Hello" --n-gpu-layers 99 --ctx-size 8192. The --n-gpu-layers 99 flag offloads all layers to GPU — reduce this number if you get CUDA out-of-memory errors. The API server starts with ./build/bin/llama-server --model model.gguf --port 8080.

“llama.cpp is what made local AI actually local. Every model you run on your hardware, every token that never leaves your machine — that’s llama.cpp doing the math.”

Part 1: Install Build Dependencies

# Ubuntu 24.04 — install build tools
sudo apt-get update
sudo apt-get install -y \
  build-essential \
  cmake \
  git \
  libcurl4-openssl-dev \
  python3-pip

# Verify cmake version (3.21+ required)
cmake --version

Expected output:

cmake version 3.28.3

For NVIDIA GPU support — install CUDA Toolkit:

# Check if CUDA is already installed
nvcc --version 2>/dev/null || echo "CUDA not installed"

# Install CUDA 12.7 (if not present)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt-get update
sudo apt-get install -y cuda-toolkit-12-7

# Add CUDA to PATH
echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

nvcc --version

Expected output:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Mon_Oct_28_18:21:19_PDT_2024
Cuda compilation tools, release 12.7, V12.7.66

Part 2: Clone and Compile llama.cpp

# Clone (depth 1 = latest commit only, faster download)
git clone --depth 1 https://github.com/ggml-org/llama.cpp
cd llama.cpp

Compile for NVIDIA GPU (CUDA)

cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DLLAMA_CURL=ON          # Enable model download from URLs

cmake --build build --config Release -j$(nproc)

Expected output (final lines):

[100%] Linking CXX executable llama-cli
[100%] Built target llama-cli
[100%] Linking CXX executable llama-server
[100%] Built target llama-server

Compilation takes 3–8 minutes depending on hardware.

Compile for Apple Silicon (Metal)

cmake -B build \
  -DGGML_METAL=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Compile for CPU only (all platforms)

cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j$(nproc)

Verify the build:

ls build/bin/ | grep -E "llama-cli|llama-server|llama-quantize|llama-speculative"

Expected output:

llama-cli
llama-quantize
llama-server
llama-speculative

# Check GPU support was compiled in
./build/bin/llama-cli --version 2>&1 | head -3

Expected output (CUDA build):

version: 3650 (b4800)
built with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
CUDA: enabled

Part 3: Download GGUF Models

# Install HuggingFace CLI
pip install huggingface-hub --break-system-packages

mkdir -p ~/models

# Download Llama 4 Scout Q4_K_M (10GB — recommended for 10GB+ VRAM)
huggingface-cli download \
  bartowski/Llama-4-Scout-17B-Instruct-GGUF \
  Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --local-dir ~/models/

# Download Qwen3 8B Q4_K_M (5.2GB — fits on any 8GB+ GPU)
huggingface-cli download \
  bartowski/Qwen3-8B-GGUF \
  Qwen3-8B-Q4_K_M.gguf \
  --local-dir ~/models/

# Download nomic-embed-text for embeddings
huggingface-cli download \
  nomic-ai/nomic-embed-text-v1.5-GGUF \
  nomic-embed-text-v1.5.Q4_K_M.gguf \
  --local-dir ~/models/

# Verify downloads
ls -lh ~/models/

Expected output:

total 15G
-rw-r--r-- 1 youruser youruser  10G Apr 17 15:10 Llama-4-Scout-17B-Instruct-Q4_K_M.gguf
-rw-r--r-- 1 youruser youruser 5.2G Apr 17 15:15 Qwen3-8B-Q4_K_M.gguf
-rw-r--r-- 1 youruser youruser 274M Apr 17 15:16 nomic-embed-text-v1.5.Q4_K_M.gguf

Part 4: Run Your First Inference

cd ~/llama.cpp

# Basic inference — one-shot prompt
./build/bin/llama-cli \
  --model ~/models/Qwen3-8B-Q4_K_M.gguf \
  --prompt "What is the capital of France? Answer in one word." \
  --n-predict 10 \
  --n-gpu-layers 99 \
  --no-display-prompt \
  -e 2>/dev/null

Expected output:

Paris

# Full inference with timing statistics
./build/bin/llama-cli \
  --model ~/models/Qwen3-8B-Q4_K_M.gguf \
  --prompt "Explain quantum computing in 3 sentences." \
  --n-predict 200 \
  --n-gpu-layers 99 \
  --ctx-size 4096 \
  --flash-attn \
  2>&1 | tail -8

Expected output (timing block):

llama_print_timings:        load time =     412.33 ms
llama_print_timings:      sample time =       5.22 ms /   200 runs
llama_print_timings:  prompt eval time =     98.45 ms /    10 tokens ( 9.85 ms/token,   101.58 tokens/s)
llama_print_timings:        eval time =   4312.67 ms /   199 runs (  21.67 ms/token,    46.15 tokens/s)
llama_print_timings:       total time =   4416.34 ms /   209 tokens

46.15 tokens/second on Qwen3 8B Q4_K_M with full GPU offload.

Part 5: Key Inference Flags

# The most important flags — memorise these
./build/bin/llama-cli \
  --model MODEL.gguf \
  --prompt "Your prompt here" \
  \
  # ── GPU offloading ──────────────────────────────────────────────────────
  --n-gpu-layers 99 \          # Offload all layers to GPU (use < 99 if OOM)
  \
  # ── Context and generation ──────────────────────────────────────────────
  --ctx-size 32768 \           # Context window (token count)
  --n-predict 500 \            # Max tokens to generate
  --batch-size 512 \           # Prompt evaluation batch size (higher = faster)
  \
  # ── Memory optimisation ─────────────────────────────────────────────────
  --flash-attn \               # Flash Attention (reduces VRAM ~30% on long contexts)
  --cache-type-k q8_0 \        # KV key cache quantization (saves VRAM)
  --cache-type-v q8_0 \        # KV value cache quantization
  \
  # ── Sampling parameters ─────────────────────────────────────────────────
  --temp 0.7 \                 # Temperature (0=deterministic, 1=creative)
  --top-p 0.9 \                # Nucleus sampling threshold
  --repeat-penalty 1.1 \       # Reduce repetition
  \
  # ── Output control ──────────────────────────────────────────────────────
  --no-display-prompt \        # Don't echo the input prompt in output
  -e                           # Process escape sequences (\n, \t etc.)

Test different context sizes and observe VRAM usage:

for ctx in 4096 8192 16384 32768; do
  echo -n "ctx_size=$ctx: "
  ./build/bin/llama-cli \
    --model ~/models/Qwen3-8B-Q4_K_M.gguf \
    --prompt "test" --n-predict 1 \
    --n-gpu-layers 99 --ctx-size $ctx --flash-attn \
    2>&1 | grep "KV self size\|n_ctx" | head -1
done

Expected output:

ctx_size=4096:  llama_new_context_with_model: KV self size  =  512.00 MiB
ctx_size=8192:  llama_new_context_with_model: KV self size  = 1024.00 MiB
ctx_size=16384: llama_new_context_with_model: KV self size  = 2048.00 MiB
ctx_size=32768: llama_new_context_with_model: KV self size  = 4096.00 MiB

KV cache size doubles with every doubling of context window — this is why context length is the dominant factor in VRAM usage for long conversations.

Part 6: llama-server — OpenAI-Compatible API

# Start the API server
./build/bin/llama-server \
  --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --ctx-size 32768 \
  --flash-attn \
  --cache-type-k q8_0 \
  --cache-type-v q8_0 \
  --port 8080 \
  --host 127.0.0.1 \
  --parallel 2 \           # Handle 2 simultaneous requests
  --cont-batching          # Continuous batching for throughput

Expected output:

llm_load_tensors: offloading 48 repeating layers to GPU
llm_load_tensors: GPU_0 model buffer size = 9847.31 MiB
llama_new_context_with_model: n_ctx = 32768
llama server listening at http://127.0.0.1:8080

Test the OpenAI-compatible chat completions endpoint:

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama4-scout",
    "messages": [{"role": "user", "content": "Write a haiku about local AI."}],
    "max_tokens": 50,
    "temperature": 0.7
  }' | python3 -m json.tool

Expected output:

{
    "id": "chatcmpl-abc123",
    "object": "chat.completion",
    "model": "llama4-scout",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "Weights on local disk,\nNo cloud hears my private thoughts—\nTokens stay with me."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 15,
        "completion_tokens": 22,
        "total_tokens": 37
    }
}

Use with the Python OpenAI SDK — zero code changes:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="not-needed"      # Required by SDK, ignored by llama-server
)

response = client.chat.completions.create(
    model="any-name",         # llama-server ignores the model name
    messages=[{"role": "user", "content": "What is 2 + 2?"}],
    max_tokens=20
)
print(response.choices[0].message.content)

Expected output:

2 + 2 = 4

Part 7: Custom Quantization

# Convert a HuggingFace model to GGUF format
# (useful when a model is only available as safetensors)

# Step 1: Download the full precision model
huggingface-cli download \
  meta-llama/Llama-3.2-1B-Instruct \
  --local-dir ~/models/llama32-1b-hf/

# Step 2: Convert to GGUF F16
python3 convert_hf_to_gguf.py \
  ~/models/llama32-1b-hf/ \
  --outfile ~/models/Llama-3.2-1B-F16.gguf \
  --outtype f16

Expected output:

INFO:hf-to-gguf:Model successfully exported to ~/models/Llama-3.2-1B-F16.gguf

# Step 3: Quantize to your target format
# Available: q2_k q3_k_s q3_k_m q3_k_l q4_0 q4_k_s q4_k_m q5_k_s q5_k_m q6_k q8_0
./build/bin/llama-quantize \
  ~/models/Llama-3.2-1B-F16.gguf \
  ~/models/Llama-3.2-1B-Q4_K_M.gguf \
  q4_k_m

Expected output:

main: quantize time =  8234.00 ms
main:    total time =  8234.00 ms

# Verify size reduction
ls -lh ~/models/Llama-3.2-1B-*.gguf

Expected output:

-rw-r--r-- 1 youruser youruser 2.4G Apr 17 15:35 Llama-3.2-1B-F16.gguf
-rw-r--r-- 1 youruser youruser 668M Apr 17 15:36 Llama-3.2-1B-Q4_K_M.gguf

2.4GB → 668MB — 72% size reduction.

Part 8: Speculative Decoding with llama-speculative

For the full explanation, see Speculative Decoding: 2x Faster Local LLMs. Here’s the quick setup:

# Run speculative decoding (draft model + target model)
./build/bin/llama-speculative \
  --model ~/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf \
  --model-draft ~/models/Llama-3.2-1B-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --n-gpu-layers-draft 99 \
  --draft 8 \
  --ctx-size 8192 \
  --flash-attn \
  --prompt "Write a Python function for bubble sort:"

Expected output (timing block):

-- speculative decoding stats --
accepted tokens: 874 / 1120 drafts   (78.0% acceptance rate)
tokens per second: 43.8 tok/s         ← vs 22 tok/s standard for this model

78% acceptance rate on code → 2x speedup.

Part 9: Benchmarking Your Hardware

# llama-bench runs standardised benchmarks
./build/bin/llama-bench \
  --model ~/models/Qwen3-8B-Q4_K_M.gguf \
  --n-gpu-layers 99 \
  --flash-attn \
  --output table

Expected output (RTX 4090):

| model               | size   |   params | backend    | ngl | test     |         t/s |
| ------------------- | ------ | -------- | ---------- | --- | -------- | ----------- |
| qwen3 8B Q4_K_M     | 5.17 G | 8.19 B   | CUDA       |  99 | pp 512   | 2847.32 ± 12.4 |
| qwen3 8B Q4_K_M     | 5.17 G | 8.19 B   | CUDA       |  99 | tg 128   |   82.14 ± 0.8  |

pp = prompt processing (prefill) speed — how fast the model processes your input
tg = token generation speed — how fast it generates output (what you feel as response latency)

Part 10: The Sovereignty Layer — Network Audit

echo "=== SOVEREIGN llama.cpp AUDIT ==="
echo ""

echo "[ Model files on local disk ]"
ls -lh ~/models/*.gguf 2>/dev/null | awk '{printf "    ✓ %-45s %s\n", $9, $5}'

echo ""
echo "[ GPU layers fully offloaded ]"
./build/bin/llama-cli \
  --model ~/models/Qwen3-8B-Q4_K_M.gguf \
  --prompt "test" --n-predict 1 --n-gpu-layers 99 2>&1 | \
  grep "offloading" | awk '{print "    ✓ " $0}'

echo ""
echo "[ Outbound network connections during inference ]"
./build/bin/llama-cli \
  --model ~/models/Qwen3-8B-Q4_K_M.gguf \
  --prompt "count to 5" --n-predict 50 \
  --n-gpu-layers 99 2>/dev/null &
LPID=$!
sleep 3
ss -tnp state established 2>/dev/null | grep llama || \
  echo "    ✓ No external connections — inference is fully local"
wait $LPID 2>/dev/null

echo ""
echo "[ API server outbound connections ]"
./build/bin/llama-server \
  --model ~/models/Qwen3-8B-Q4_K_M.gguf \
  --n-gpu-layers 99 --port 8080 --host 127.0.0.1 2>/dev/null &
SPID=$!
sleep 4
ss -tnp state established 2>/dev/null | grep llama | \
  grep -v "127.0.0.1\|::1" || echo "    ✓ Server makes no external connections"
kill $SPID 2>/dev/null

Expected output:

=== SOVEREIGN llama.cpp AUDIT ===

[ Model files on local disk ]
    ✓ /home/youruser/models/Llama-4-Scout-17B-Instruct-Q4_K_M.gguf     10G
    ✓ /home/youruser/models/Qwen3-8B-Q4_K_M.gguf                      5.2G
    ✓ /home/youruser/models/nomic-embed-text-v1.5.Q4_K_M.gguf         274M

[ GPU layers fully offloaded ]
    ✓ llm_load_tensors: offloading 36 repeating layers to GPU
    ✓ llm_load_tensors: GPU_0 model buffer size =  4873.21 MiB

[ Outbound network connections during inference ]
    ✓ No external connections — inference is fully local

[ API server outbound connections ]
    ✓ Server makes no external connections

SovereignScore: 98/100 — 2 points for one-time model downloads from HuggingFace.

Complete Flag Reference

# ── Model loading ──────────────────────────────────────────────────────────
--model FILE              # GGUF model file path
--n-gpu-layers N          # Layers to GPU (99 = all, 0 = CPU-only)
--main-gpu N              # GPU index for main computation (multi-GPU)
--tensor-split 0.5,0.5   # Split across 2 GPUs evenly

# ── Context and generation ─────────────────────────────────────────────────
--ctx-size N              # Context window (max tokens in conversation)
--n-predict N             # Max tokens to generate (-1 = unlimited)
--batch-size N            # Prompt batch size (512-2048, higher=faster prefill)
--ubatch-size N           # Micro-batch size for pipeline

# ── Memory optimization ────────────────────────────────────────────────────
--flash-attn              # Flash Attention (saves VRAM on long contexts)
--cache-type-k TYPE       # KV key cache type: f32 f16 q8_0 q4_0
--cache-type-v TYPE       # KV value cache type: f32 f16 q8_0 q4_0
--mlock                   # Lock model in RAM (prevents swapping)
--no-mmap                 # Load model into RAM (vs memory-mapped)

# ── Sampling ──────────────────────────────────────────────────────────────
--temp F                  # Temperature (0.0-1.0, 0=greedy/deterministic)
--top-p F                 # Top-p nucleus sampling (0.0-1.0)
--top-k N                 # Top-k sampling (0=disabled)
--min-p F                 # Min-p sampling (alternative to top-p)
--repeat-penalty F        # Penalty for repeated tokens (1.0=none, 1.1=mild)
--seed N                  # Random seed for reproducibility (-1=random)

# ── Speculative decoding ───────────────────────────────────────────────────
--model-draft FILE        # Draft model for speculative decoding
--n-gpu-layers-draft N    # GPU layers for draft model
--draft N                 # Tokens to draft per cycle (4-12)

# ── Output ────────────────────────────────────────────────────────────────
--no-display-prompt       # Don't echo input in output
-e                        # Process escape sequences (\n \t)
--log-disable             # Disable log output (for piping)
-v / --verbose            # Verbose output with timing

# ── Server specific ───────────────────────────────────────────────────────
--port N                  # HTTP port (default 8080)
--host ADDR               # Bind address (127.0.0.1 for local only)
--parallel N              # Concurrent request slots
--cont-batching           # Enable continuous batching

Troubleshooting

`CUDA error: no kernel image is available for execution on the device`

Cause: The GGML CUDA kernels were compiled for a different GPU architecture. Fix:

# Specify your GPU's compute capability during compilation
cmake -B build -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="89"  # RTX 4090 = 89, RTX 3080 = 86, RTX 3060 = 86
cmake --build build --config Release -j$(nproc)

Find your GPU’s compute capability at: developer.nvidia.com/cuda-gpus

Model loads but all layers are on CPU despite `--n-gpu-layers 99`

Cause: The build doesn’t include CUDA support. Fix:

./build/bin/llama-cli --version 2>&1 | grep CUDA
# If no output: recompile with -DGGML_CUDA=ON

`llama-server: error: Failed to bind`

Cause: Port 8080 is already in use. Fix:

sudo lsof -i :8080  # Find what's using it
# Change port: --port 8081

Conclusion

llama.cpp gives you the maximum possible control over local LLM inference: every quantization option, every sampling parameter, custom model conversion, speculative decoding configuration, and multi-GPU splitting. The OpenAI-compatible API server drops in as a replacement for cloud APIs with zero code changes. All inference runs on your hardware with zero external connections.

Together with GGUF Quantization Explained (choosing the right format) and Speculative Decoding (doubling throughput), this article completes the llama.cpp trilogy — everything you need for sovereign production-grade local AI inference.

llama.cpp Tutorial 2026: Run GGUF Models Locally on CPU and GPU

Key Takeaways

Introduction: Why Use llama.cpp Directly?

Part 1: Install Build Dependencies

Part 2: Clone and Compile llama.cpp

Compile for NVIDIA GPU (CUDA)

Compile for Apple Silicon (Metal)

Compile for CPU only (all platforms)

Part 3: Download GGUF Models

Part 4: Run Your First Inference

Part 5: Key Inference Flags

Part 6: llama-server — OpenAI-Compatible API

Part 7: Custom Quantization

Part 8: Speculative Decoding with llama-speculative

Part 9: Benchmarking Your Hardware

Part 10: The Sovereignty Layer — Network Audit

Complete Flag Reference

Troubleshooting

`CUDA error: no kernel image is available for execution on the device`

Model loads but all layers are on CPU despite `--n-gpu-layers 99`

`llama-server: error: Failed to bind`

Conclusion

People Also Ask

Is llama.cpp faster than Ollama?

Can llama.cpp run on AMD GPUs?

How do I run multiple models simultaneously?

Further Reading

Further Reading

GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — Which to Use in 2026

How to Install Ollama and Run LLMs Locally: Complete 2026 Guide

Speculative Decoding Explained: 2x Faster Local LLMs with Ollama and llama.cpp 2026

Comments

MySQL Performance Tuning 2026: Indexing, EXPLAIN, Buffer Pool, and AI-Driven Optimization on Ubuntu

LLM Evaluation 2026: Local RAG, RAGAS, LLM-as-Judge, and AI Metrics on Ubuntu

MLOps 2026: MLflow, BentoML, Self-Hosted Model Serving, and AI Experiment Tracking on Ubuntu

K3s Ingress 2026: Secure Kubernetes Ingress with Traefik, Nginx, Cilium, and TLS on Ubuntu

LLM Guardrails 2026: Output Validation, Hallucination Detection, Schema Enforcement, and AI Safety on Ubuntu

Recently Visited

Key Takeaways

Introduction: Why Use llama.cpp Directly?

Part 1: Install Build Dependencies

Part 2: Clone and Compile llama.cpp

Compile for NVIDIA GPU (CUDA)

Compile for Apple Silicon (Metal)

Compile for CPU only (all platforms)

Part 3: Download GGUF Models

Part 4: Run Your First Inference

Part 5: Key Inference Flags

Part 6: llama-server — OpenAI-Compatible API

Part 7: Custom Quantization

Part 8: Speculative Decoding with llama-speculative

Part 9: Benchmarking Your Hardware

Part 10: The Sovereignty Layer — Network Audit

Complete Flag Reference

Troubleshooting

CUDA error: no kernel image is available for execution on the device

Model loads but all layers are on CPU despite --n-gpu-layers 99

llama-server: error: Failed to bind

Conclusion

People Also Ask

Is llama.cpp faster than Ollama?

Can llama.cpp run on AMD GPUs?

How do I run multiple models simultaneously?

Further Reading

Further Reading

GGUF Quantization Explained: Q4_K_M vs Q8_0 vs F16 — Which to Use in 2026

How to Install Ollama and Run LLMs Locally: Complete 2026 Guide

Speculative Decoding Explained: 2x Faster Local LLMs with Ollama and llama.cpp 2026

The Sovereign Brief

You're in!

Comments

Recently Visited

`CUDA error: no kernel image is available for execution on the device`

Model loads but all layers are on CPU despite `--n-gpu-layers 99`

`llama-server: error: Failed to bind`