AI & Machine Learning Guide

How to Run LLMs on Mac Mini M4 (Llama, Mistral, Phi)

Apple Silicon's unified memory architecture makes Mac Mini M4 one of the most cost-effective platforms for running large language models locally. This guide covers Ollama, llama.cpp, and MLX with real benchmarks and practical code examples.

20 min read Updated January 2025 Intermediate

1. Why Mac Mini M4 for LLMs?

The Mac Mini M4 is uniquely suited for running large language models thanks to Apple Silicon's architecture. Unlike traditional GPU servers where VRAM limits model size, the M4's unified memory allows the CPU, GPU, and Neural Engine to share the same memory pool -- meaning a 24GB Mac Mini can load models that would require an expensive GPU with 24GB VRAM.

🧠

Unified Memory Architecture

Unlike NVIDIA GPUs with separate VRAM, M4's unified memory lets the GPU access the full system RAM. A 24GB Mac Mini effectively has 24GB of "VRAM" for model loading, without the PCI-E bandwidth bottleneck.

Neural Engine

The M4's 16-core Neural Engine delivers up to 38 TOPS of ML performance. Frameworks like CoreML and MLX leverage this for accelerated matrix operations critical to transformer inference.

🔌

Power Efficiency

The Mac Mini M4 consumes just 5-15W under typical LLM inference load, compared to 300-450W for an NVIDIA A100. This translates to dramatically lower hosting costs and no specialized cooling required.

💰

Cost Effective

Starting at $75/mo for a dedicated Mac Mini M4 with 16GB, you get predictable pricing with no per-token API costs. Run unlimited inference requests 24/7 at a fraction of GPU cloud pricing.

Key Insight: For inference workloads (not training), the Mac Mini M4 offers the best performance-per-dollar ratio in the industry. You get dedicated hardware with no noisy neighbors, no per-token billing, and Apple Silicon's memory bandwidth of up to 120 GB/s.

2. LLM Frameworks Comparison

Three frameworks dominate the Apple Silicon LLM ecosystem. Each has distinct strengths depending on your use case.

Feature Ollama llama.cpp MLX
Ease of Setup Very Easy Moderate Easy
Metal GPU Support Yes (auto) Yes (flag) Yes (native)
API Server Built-in Built-in Manual
Model Format GGUF (auto-download) GGUF SafeTensors / MLX
Performance Good Best for GGUF Best for Apple Silicon
Model Library Curated (ollama.com) HuggingFace GGUF HuggingFace MLX
Language Go (CLI/API) C++ (CLI/API) Python
Best For Quick deployment, API serving Max control, custom builds Python ML pipelines, research

3. Setup with Ollama

Ollama is the easiest way to get started with LLMs on Mac Mini M4. It handles model downloading, quantization, and API serving with a single binary.

Step 1: Install Ollama

# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version
# ollama version 0.5.4

Step 2: Download Models

# Download Llama 3 8B (4.7GB, fits 16GB RAM)
ollama pull llama3:8b

# Download Mistral 7B (4.1GB)
ollama pull mistral:7b

# Download Phi-3 Mini (2.3GB, great for constrained setups)
ollama pull phi3:mini

# Download Llama 3 70B (requires 48GB+ RAM)
ollama pull llama3:70b

# List downloaded models
ollama list
# NAME            SIZE     MODIFIED
# llama3:8b       4.7 GB   2 minutes ago
# mistral:7b      4.1 GB   5 minutes ago
# phi3:mini       2.3 GB   8 minutes ago

Step 3: Run Interactive Chat

# Start an interactive chat session
ollama run llama3:8b

# Example interaction:
# >>> What is the capital of France?
# The capital of France is Paris. It is the largest city in France
# and serves as the country's political, economic, and cultural center.

Step 4: Serve as an API

Ollama automatically starts a REST API server on port 11434. You can query it from any application using the OpenAI-compatible API.

# The Ollama server starts automatically, listening on localhost:11434

# Query using curl (OpenAI-compatible endpoint)
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3:8b",
    "messages": [
      {"role": "system", "content": "You are a helpful coding assistant."},
      {"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}
    ],
    "temperature": 0.7,
    "max_tokens": 500
  }'

# Native Ollama API endpoint
curl http://localhost:11434/api/generate \
  -d '{
    "model": "llama3:8b",
    "prompt": "Explain quantum computing in 3 sentences.",
    "stream": false
  }'

Step 5: Use from Python

# pip install openai
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Ollama doesn't require an API key
)

response = client.chat.completions.create(
    model="llama3:8b",
    messages=[
        {"role": "system", "content": "You are a senior Python developer."},
        {"role": "user", "content": "Write a FastAPI endpoint for user registration."}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)

Step 6: Run Ollama as a Background Service

# Create a launchd plist for auto-start on boot
cat <<EOF > ~/Library/LaunchAgents/com.ollama.server.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.server</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/ollama</string>
        <string>serve</string>
    </array>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
</dict>
</plist>
EOF

# Load the service
launchctl load ~/Library/LaunchAgents/com.ollama.server.plist

# Verify it's running
curl http://localhost:11434/api/tags

4. Setup with llama.cpp

llama.cpp gives you maximum control over inference parameters and often delivers the best raw performance on Apple Silicon thanks to its hand-optimized Metal backend.

Step 1: Clone and Build with Metal

# Install dependencies
brew install cmake

# Clone the repository
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Build with Metal GPU acceleration (Apple Silicon)
mkdir build && cd build
cmake .. -DLLAMA_METAL=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --config Release -j$(sysctl -n hw.ncpu)

# Verify Metal support
./bin/llama-cli --help | grep metal

Step 2: Download GGUF Models

# Install huggingface-cli for easy downloads
pip install huggingface_hub

# Download Llama 3 8B Q4_K_M (best quality/speed balance)
huggingface-cli download \
  TheBloke/Llama-3-8B-GGUF \
  llama-3-8b.Q4_K_M.gguf \
  --local-dir ./models

# Download Mistral 7B Q4_K_M
huggingface-cli download \
  TheBloke/Mistral-7B-Instruct-v0.2-GGUF \
  mistral-7b-instruct-v0.2.Q4_K_M.gguf \
  --local-dir ./models

# Download Phi-3 Mini Q4
huggingface-cli download \
  microsoft/Phi-3-mini-4k-instruct-gguf \
  Phi-3-mini-4k-instruct-q4.gguf \
  --local-dir ./models

Step 3: Run Inference

# Run Llama 3 8B with Metal GPU offloading (all layers)
./build/bin/llama-cli \
  -m ./models/llama-3-8b.Q4_K_M.gguf \
  -ngl 99 \
  -c 4096 \
  -t 8 \
  --temp 0.7 \
  -p "Explain how transformers work in machine learning:"

# Key flags:
# -ngl 99     : Offload all layers to Metal GPU
# -c 4096     : Context window size
# -t 8        : Number of CPU threads (M4 has 10 cores)
# --temp 0.7  : Temperature for sampling

Step 4: Start the API Server

# Start OpenAI-compatible API server
./build/bin/llama-server \
  -m ./models/llama-3-8b.Q4_K_M.gguf \
  -ngl 99 \
  -c 4096 \
  --host 0.0.0.0 \
  --port 8080 \
  --parallel 4

# Test the API
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama-3-8b",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

5. Setup with MLX

MLX is Apple's own machine learning framework, designed specifically for Apple Silicon. It offers the tightest integration with the M4's GPU and Neural Engine, making it ideal for Python-based ML workflows.

Step 1: Install MLX

# Create a virtual environment
python3 -m venv ~/mlx-env
source ~/mlx-env/bin/activate

# Install MLX and the LLM package
pip install mlx mlx-lm

# Verify installation
python3 -c "import mlx.core as mx; print(mx.default_device())"
# Device(gpu, 0)

Step 2: Run Inference with MLX

# Run Llama 3 8B using mlx-lm CLI
mlx_lm.generate \
  --model mlx-community/Meta-Llama-3-8B-Instruct-4bit \
  --prompt "Write a Python decorator for rate limiting:" \
  --max-tokens 500 \
  --temp 0.7

# Run Mistral 7B
mlx_lm.generate \
  --model mlx-community/Mistral-7B-Instruct-v0.3-4bit \
  --prompt "Explain microservices architecture:" \
  --max-tokens 500

Step 3: Python Integration

from mlx_lm import load, generate

# Load the model (downloads on first run)
model, tokenizer = load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")

# Generate text
prompt = "Write a bash script to monitor disk usage and send alerts:"
response = generate(
    model,
    tokenizer,
    prompt=prompt,
    max_tokens=500,
    temp=0.7,
    top_p=0.9
)
print(response)

# Streaming generation
from mlx_lm import stream_generate

for token in stream_generate(
    model, tokenizer,
    prompt="Explain Docker networking:",
    max_tokens=300
):
    print(token, end="", flush=True)

Step 4: Build a Simple API with MLX

# pip install fastapi uvicorn mlx-lm
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from mlx_lm import load, stream_generate
import json

app = FastAPI()
model, tokenizer = load("mlx-community/Meta-Llama-3-8B-Instruct-4bit")

@app.post("/v1/completions")
async def completions(request: dict):
    prompt = request.get("prompt", "")
    max_tokens = request.get("max_tokens", 256)

    response = ""
    for token in stream_generate(model, tokenizer, prompt=prompt, max_tokens=max_tokens):
        response += token

    return {"choices": [{"text": response}]}

@app.post("/v1/chat/completions")
async def chat(request: dict):
    messages = request.get("messages", [])
    prompt = tokenizer.apply_chat_template(messages, tokenize=False)

    response = ""
    for token in stream_generate(model, tokenizer, prompt=prompt, max_tokens=512):
        response += token

    return {
        "choices": [{
            "message": {"role": "assistant", "content": response}
        }]
    }

# Run: uvicorn server:app --host 0.0.0.0 --port 8000

6. Performance Benchmarks

Real-world benchmarks measured using Ollama with Q4_K_M quantization. All tests use 512 token prompt, 256 token generation, and default sampling parameters.

Hardware Model Tokens/sec Time to First Token Cost/mo
Mac Mini M4 16GB Llama 3 8B Q4 ~35 tok/s ~180ms $75
Mac Mini M4 16GB Mistral 7B Q4 ~38 tok/s ~160ms $75
Mac Mini M4 24GB Llama 3 13B Q4 ~22 tok/s ~320ms $95
Mac Mini M4 Pro 48GB Llama 3 70B Q4 ~12 tok/s ~850ms $179
RTX 4090 (cloud) Llama 3 8B Q4 ~120 tok/s ~50ms $500+

Note: While NVIDIA GPUs offer higher raw throughput, Mac Mini M4 delivers excellent tokens/second for interactive use cases at a fraction of the cost. At 35 tok/s, responses feel instantaneous for chat applications. The real advantage is cost: $75/mo unlimited vs. pay-per-token API pricing that can easily exceed $500/mo.

7. Which Models Fit Which Config?

The key factor is unified memory. With Q4 quantization, models use roughly 0.5-0.6 GB per billion parameters, plus overhead for context and the OS.

Memory Model Size Range Example Models Price/mo
16 GB 7B - 13B (Q4) Llama 3 8B, Mistral 7B, Phi-3 Mini, Gemma 7B $75
24 GB 13B - 34B (Q4) Llama 3 13B, CodeLlama 34B, Yi 34B $95
48 GB 34B - 70B (Q4) Llama 3 70B, Mixtral 8x7B, DeepSeek 67B $179
64 GB+ 70B+ (Q4/Q6) Llama 3 70B Q6, Mixtral 8x22B, Command-R+ $249+
# Quick formula to estimate memory requirements:
# Memory needed = (Parameters in B * Bits per weight / 8) + context overhead
#
# Example: Llama 3 70B at Q4 quantization
# = (70 * 4 / 8) GB = 35 GB model weights
# + ~4 GB context/overhead
# = ~39 GB total → fits in 48GB Mac Mini M4 Pro
#
# Check current memory usage while running a model:
ollama ps
# NAME          SIZE     PROCESSOR    UNTIL
# llama3:8b     5.1 GB   100% GPU     4 minutes from now

8. Use Cases

Private AI Assistant

Run a ChatGPT-like assistant that keeps all data on your server. No data leaves your infrastructure. Perfect for handling sensitive documents, customer data, or proprietary code.

Recommended: Llama 3 8B on 16GB

RAG Pipeline

Build a Retrieval-Augmented Generation system that searches your documents and generates answers. Use ChromaDB or Qdrant for embeddings with Ollama for generation.

Recommended: Mistral 7B on 16GB

Code Generation

Use specialized coding models like CodeLlama or DeepSeek Coder for autocomplete, code review, and automated refactoring. Integrate with VS Code or JetBrains via Continue.dev.

Recommended: CodeLlama 34B on 48GB

Content Generation

Generate marketing copy, blog posts, product descriptions, and email templates at scale. Run batch jobs overnight without per-token API costs adding up.

Recommended: Llama 3 70B on 48GB

9. Performance Tips

Choose the Right Quantization

Quantization level dramatically impacts both speed and quality. Q4_K_M offers the best balance for most use cases.

# Quantization levels (from fastest to best quality):
# Q2_K  - Fastest, lowest quality, smallest size
# Q3_K  - Fast, acceptable quality
# Q4_K_M - Best balance of speed and quality (RECOMMENDED)
# Q5_K_M - Slower, better quality
# Q6_K  - Slow, near-original quality
# Q8_0  - Slowest, best quality, largest size
# F16   - Full precision, requires 2x memory

# Example: Download Q4_K_M for best balance
ollama pull llama3:8b-instruct-q4_K_M

Maximize Metal GPU Offloading

Ensure all model layers run on the GPU for maximum performance. Partial CPU offloading significantly reduces throughput.

# llama.cpp: offload all layers to GPU
./llama-cli -m model.gguf -ngl 99

# Check GPU utilization
sudo powermetrics --samplers gpu_power -n 1 -i 1000

# Monitor memory pressure
memory_pressure
# System-wide memory free percentage: 45%

Optimize Batch Size and Context

Reducing context window size frees memory and can improve throughput. Only use as much context as your application actually needs.

# Default context is often 4096 or 8192 tokens
# Reduce if you don't need long context:
ollama run llama3:8b --num-ctx 2048

# For llama.cpp, set context and batch size:
./llama-server -m model.gguf -ngl 99 \
  -c 2048 \      # Context window
  -b 512 \       # Batch size for prompt processing
  --parallel 2   # Concurrent request slots

Keep Models Hot in Memory

Loading a model from disk takes several seconds. Keep frequently-used models resident in memory for instant responses.

# Ollama: set keep-alive to keep model in memory indefinitely
curl http://localhost:11434/api/generate -d '{
  "model": "llama3:8b",
  "keep_alive": -1
}'

# Or set environment variable for default behavior
export OLLAMA_KEEP_ALIVE=-1

# Check which models are loaded
ollama ps

10. Frequently Asked Questions

Can I run ChatGPT-level models on a Mac Mini M4?

Yes. Models like Llama 3 8B and Mistral 7B deliver quality comparable to GPT-3.5 for many tasks. For GPT-4-level quality, you would need a 70B model which requires 48GB+ of unified memory (Mac Mini M4 Pro). The experience is excellent for coding assistance, document Q&A, and content generation.

Is 35 tokens/second fast enough for real-time chat?

Absolutely. Average human reading speed is about 4-5 words per second, which translates to roughly 5-7 tokens per second. At 35 tok/s, the model generates text 5-7x faster than a human can read it. For chat applications, this feels completely instant.

How many concurrent users can a Mac Mini M4 handle?

With a 7B model, a single Mac Mini M4 can handle 2-4 concurrent requests with acceptable latency. For higher concurrency, you can run multiple Mac Minis behind a load balancer. The Ollama and llama.cpp servers both support concurrent request queuing.

Can I fine-tune models on Mac Mini M4?

Yes, with limitations. You can fine-tune 7B models using LoRA/QLoRA on 16GB devices using MLX or the Hugging Face PEFT library. Full fine-tuning of larger models requires more memory. For production fine-tuning of 70B+ models, GPU servers are more practical.

Which framework should I choose: Ollama, llama.cpp, or MLX?

Choose Ollama if you want the fastest setup and easy API serving. Choose llama.cpp for maximum control over inference parameters and the best GGUF model performance. Choose MLX if you are building Python ML pipelines and want native Apple Silicon optimization. Many users start with Ollama and move to llama.cpp or MLX as their needs grow.

Related Guides

Start Running LLMs on Apple Silicon

Get a dedicated Mac Mini M4 server and run Llama, Mistral, or Phi with unlimited inference. From $75/mo with a 7-day free trial.