1. Why Build a Private AI Server?

Sending sensitive data to third-party AI APIs like OpenAI, Anthropic, or Google introduces risks that many organizations cannot accept. A private AI server keeps every byte of data under your control -- on hardware you own or rent exclusively -- with zero external API calls.

🔒

Data Sovereignty

Your prompts, documents, and model outputs never leave your infrastructure. No third-party logging, no training on your data, no risk of data leaks through API endpoints you do not control.

📋

Regulatory Compliance

Meet GDPR, HIPAA, SOC 2, and industry-specific compliance requirements by keeping AI processing within your data boundary. No cross-border data transfer concerns.

💰

Cost Predictability

Fixed monthly cost regardless of usage. No per-token billing surprises, no sudden rate limit changes, no pricing increases from API providers. Run unlimited inference 24/7 at a flat rate.

⚡

No Rate Limits

Process as many requests as your hardware can handle. No tokens-per-minute caps, no request queuing on the provider side, no degraded performance during peak hours.

The Risk of Cloud APIs: When you send data to a cloud AI provider, you lose control. Even with contractual guarantees, your data traverses networks you don't manage, resides on servers you don't control, and is subject to the provider's security posture. For regulated industries -- healthcare, finance, legal, defense -- this is often a non-starter.

Why Mac Mini M4? Apple Silicon's unified memory architecture lets you run 7B-70B parameter models on a single device that consumes under 15W of power. The M4's memory bandwidth (up to 120 GB/s) feeds model weights to the GPU efficiently, delivering 30-40 tokens/second for 7B models -- fast enough for real-time chat. Starting at $75/mo for dedicated hardware, it is the most cost-effective way to build a private AI server.

2. Architecture Overview

The private AI server stack consists of five layers, each running locally on your Mac Mini M4. No external services are required.

System Architecture

Clients

Web Browser / API Consumers / Mobile Apps

nginx Reverse Proxy

SSL/TLS Termination + Rate Limiting + Auth

Open WebUI

Chat Interface (Port 3000)

RAG Pipeline

FastAPI + LangChain (Port 8000)

Ollama

LLM Inference Engine (Port 11434) -- Llama 3, Mistral, CodeLlama

ChromaDB

Vector Database (Port 8200)

Mac Mini M4

Apple Silicon + Unified Memory

# Port mapping summary for the full stack:
#
# Port 443   - nginx (HTTPS, public-facing)
# Port 80    - nginx (HTTP, redirects to HTTPS)
# Port 11434 - Ollama (LLM inference, internal only)
# Port 3000  - Open WebUI (chat interface, proxied via nginx)
# Port 8000  - RAG API (FastAPI, proxied via nginx)
# Port 8200  - ChromaDB (vector database, internal only)
# Port 51820 - WireGuard VPN (optional, for remote access)

3. Step 1: Server Setup

Start by provisioning your Mac Mini M4 and configuring secure remote access. If you are using My Remote Mac, SSH access is provided out of the box.

SSH Configuration

# Connect to your Mac Mini M4 via SSH
ssh admin@your-mac-mini.myremotemac.com

# Generate an SSH key pair (if you don't have one)
ssh-keygen -t ed25519 -C "ai-server-admin" -f ~/.ssh/id_ai_server

# Copy your public key to the server
ssh-copy-id -i ~/.ssh/id_ai_server.pub admin@your-mac-mini.myremotemac.com

# Configure SSH client for easier access
cat <<EOF >> ~/.ssh/config
Host ai-server
    HostName your-mac-mini.myremotemac.com
    User admin
    IdentityFile ~/.ssh/id_ai_server
    ForwardAgent no
    ServerAliveInterval 60
    ServerAliveCountMax 3
EOF

# Now connect with just:
ssh ai-server

System Updates & Preparation

# Update macOS to latest version
sudo softwareupdate --install --all --agree-to-license

# Install Homebrew (package manager)
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

# Install essential tools
brew install wget curl jq htop python@3.12 git

# Create a dedicated directory for AI services
mkdir -p ~/ai-server/{models,data,logs,config}
mkdir -p ~/ai-server/rag/{documents,vectorstore}

# Set up Python virtual environment for AI tools
python3.12 -m venv ~/ai-server/venv
source ~/ai-server/venv/bin/activate
pip install --upgrade pip setuptools wheel

Create a Dedicated Service User (Optional)

# Create a dedicated user for AI services (principle of least privilege)
sudo dscl . -create /Users/aiservice
sudo dscl . -create /Users/aiservice UserShell /bin/zsh
sudo dscl . -create /Users/aiservice RealName "AI Service Account"
sudo dscl . -create /Users/aiservice UniqueID 550
sudo dscl . -create /Users/aiservice PrimaryGroupID 20
sudo dscl . -create /Users/aiservice NFSHomeDirectory /Users/aiservice
sudo mkdir -p /Users/aiservice
sudo chown aiservice:staff /Users/aiservice

# Grant access to the AI server directory
sudo chown -R aiservice:staff ~/ai-server

4. Step 2: Install Ollama & Models

Ollama is the backbone of your private AI server. It manages model downloads, quantization, and provides an OpenAI-compatible API -- all running locally with zero external calls.

Install Ollama

# Download and install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Verify installation
ollama --version
# ollama version 0.5.4

# Start the Ollama server (runs on localhost:11434 by default)
ollama serve &

Pull Models for Different Use Cases

# General-purpose assistant (recommended starting point)
ollama pull llama3:8b              # 4.7 GB - fits 16GB RAM

# Instruction-following and reasoning
ollama pull mistral:7b             # 4.1 GB - excellent for RAG

# Code generation and analysis
ollama pull codellama:7b           # 3.8 GB - code-specific model
ollama pull codellama:13b          # 7.4 GB - better code quality (needs 24GB)

# Embedding model for RAG pipeline
ollama pull nomic-embed-text       # 274 MB - text embeddings

# Verify all models are downloaded
ollama list
# NAME                    SIZE      MODIFIED
# llama3:8b               4.7 GB    2 minutes ago
# mistral:7b              4.1 GB    5 minutes ago
# codellama:7b            3.8 GB    8 minutes ago
# nomic-embed-text        274 MB    10 minutes ago

# Test a model interactively
ollama run llama3:8b "What are the benefits of self-hosted AI?"

Configure Ollama as a Persistent Service

# Create a launchd plist for Ollama to start on boot
cat <<'EOF' > ~/Library/LaunchAgents/com.ollama.server.plist
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN"
  "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>com.ollama.server</string>
    <key>ProgramArguments</key>
    <array>
        <string>/usr/local/bin/ollama</string>
        <string>serve</string>
    </array>
    <key>EnvironmentVariables</key>
    <dict>
        <key>OLLAMA_HOST</key>
        <string>127.0.0.1:11434</string>
        <key>OLLAMA_KEEP_ALIVE</key>
        <string>-1</string>
        <key>OLLAMA_NUM_PARALLEL</key>
        <string>4</string>
    </dict>
    <key>RunAtLoad</key>
    <true/>
    <key>KeepAlive</key>
    <true/>
    <key>StandardOutPath</key>
    <string>/Users/admin/ai-server/logs/ollama.log</string>
    <key>StandardErrorPath</key>
    <string>/Users/admin/ai-server/logs/ollama-error.log</string>
</dict>
</plist>
EOF

# Load the service
launchctl load ~/Library/LaunchAgents/com.ollama.server.plist

# Verify Ollama is running
curl -s http://localhost:11434/api/tags | jq '.models[].name'
# "llama3:8b"
# "mistral:7b"
# "codellama:7b"
# "nomic-embed-text"

Important: Notice that OLLAMA_HOST is set to 127.0.0.1:11434 (localhost only). This ensures Ollama is not directly accessible from the network. All external access will go through the nginx reverse proxy configured in Step 5.

5. Step 3: Build a RAG Pipeline

Retrieval-Augmented Generation (RAG) lets your AI answer questions based on your own documents -- company wikis, legal contracts, technical documentation -- without sending any of that data to a cloud provider. Here we build a complete RAG pipeline with ChromaDB and LangChain.

Install Dependencies

# Activate the virtual environment
source ~/ai-server/venv/bin/activate

# Install RAG pipeline dependencies
pip install \
    langchain==0.1.20 \
    langchain-community==0.0.38 \
    langchain-chroma==0.1.0 \
    chromadb==0.4.24 \
    sentence-transformers==2.7.0 \
    pypdf==4.2.0 \
    docx2txt==0.8 \
    fastapi==0.111.0 \
    uvicorn==0.29.0 \
    python-multipart==0.0.9 \
    pydantic==2.7.1

Document Ingestion Pipeline

# ~/ai-server/rag/ingest.py
"""
Document ingestion pipeline for the private RAG system.
Loads PDFs, DOCX, and text files, splits them into chunks,
generates embeddings via Ollama, and stores them in ChromaDB.
"""
import os
import sys
from pathlib import Path
from langchain_community.document_loaders import (
    PyPDFLoader,
    Docx2txtLoader,
    TextLoader,
    DirectoryLoader,
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_chroma import Chroma

# Configuration
DOCUMENTS_DIR = os.path.expanduser("~/ai-server/rag/documents")
VECTORSTORE_DIR = os.path.expanduser("~/ai-server/rag/vectorstore")
OLLAMA_BASE_URL = "http://localhost:11434"
EMBEDDING_MODEL = "nomic-embed-text"
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200

def load_documents(directory: str):
    """Load all supported document types from a directory."""
    documents = []
    path = Path(directory)

    # Load PDFs
    for pdf_file in path.glob("**/*.pdf"):
        loader = PyPDFLoader(str(pdf_file))
        documents.extend(loader.load())
        print(f"  Loaded: {pdf_file.name} ({len(loader.load())} pages)")

    # Load DOCX files
    for docx_file in path.glob("**/*.docx"):
        loader = Docx2txtLoader(str(docx_file))
        documents.extend(loader.load())
        print(f"  Loaded: {docx_file.name}")

    # Load text files
    for txt_file in path.glob("**/*.txt"):
        loader = TextLoader(str(txt_file))
        documents.extend(loader.load())
        print(f"  Loaded: {txt_file.name}")

    # Load markdown files
    for md_file in path.glob("**/*.md"):
        loader = TextLoader(str(md_file))
        documents.extend(loader.load())
        print(f"  Loaded: {md_file.name}")

    return documents

def create_vectorstore(documents):
    """Split documents into chunks and store embeddings in ChromaDB."""
    # Split documents into chunks
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=CHUNK_SIZE,
        chunk_overlap=CHUNK_OVERLAP,
        length_function=len,
        separators=["\n\n", "\n", ". ", " ", ""],
    )
    chunks = text_splitter.split_documents(documents)
    print(f"\nSplit {len(documents)} documents into {len(chunks)} chunks")

    # Create embeddings using Ollama (runs locally!)
    embeddings = OllamaEmbeddings(
        model=EMBEDDING_MODEL,
        base_url=OLLAMA_BASE_URL,
    )

    # Store in ChromaDB
    vectorstore = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=VECTORSTORE_DIR,
        collection_name="private_docs",
    )
    print(f"Stored {len(chunks)} chunks in ChromaDB at {VECTORSTORE_DIR}")
    return vectorstore

if __name__ == "__main__":
    print("=== Private RAG Document Ingestion ===\n")
    print(f"Loading documents from: {DOCUMENTS_DIR}")
    docs = load_documents(DOCUMENTS_DIR)
    print(f"\nTotal documents loaded: {len(docs)}")

    if not docs:
        print("No documents found. Add files to ~/ai-server/rag/documents/")
        sys.exit(1)

    print("\nCreating vector store...")
    create_vectorstore(docs)
    print("\nIngestion complete!")

RAG Query API

# ~/ai-server/rag/api.py
"""
RAG Query API - FastAPI service for document Q&A.
Retrieves relevant chunks from ChromaDB and generates
answers using Ollama. Everything runs locally.
"""
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from pydantic import BaseModel
from typing import Optional, List
import os
import shutil

from langchain_community.embeddings import OllamaEmbeddings
from langchain_chroma import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Configuration
VECTORSTORE_DIR = os.path.expanduser("~/ai-server/rag/vectorstore")
DOCUMENTS_DIR = os.path.expanduser("~/ai-server/rag/documents")
OLLAMA_BASE_URL = "http://localhost:11434"
EMBEDDING_MODEL = "nomic-embed-text"
LLM_MODEL = "mistral:7b"

app = FastAPI(
    title="Private RAG API",
    description="Self-hosted document Q&A with zero cloud dependencies",
    version="1.0.0",
)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_methods=["*"],
    allow_headers=["*"],
)

# Initialize components
embeddings = OllamaEmbeddings(
    model=EMBEDDING_MODEL,
    base_url=OLLAMA_BASE_URL,
)

vectorstore = Chroma(
    persist_directory=VECTORSTORE_DIR,
    embedding_function=embeddings,
    collection_name="private_docs",
)

llm = Ollama(
    model=LLM_MODEL,
    base_url=OLLAMA_BASE_URL,
    temperature=0.3,
    num_ctx=4096,
)

# Custom prompt template
PROMPT_TEMPLATE = """Use the following context to answer the question.
If you cannot find the answer in the context, say "I don't have enough
information in the provided documents to answer this question."

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate(
    template=PROMPT_TEMPLATE,
    input_variables=["context", "question"],
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(
        search_type="similarity",
        search_kwargs={"k": 4},
    ),
    chain_type_kwargs={"prompt": prompt},
    return_source_documents=True,
)


class QueryRequest(BaseModel):
    question: str
    model: Optional[str] = "mistral:7b"
    num_results: Optional[int] = 4


class QueryResponse(BaseModel):
    answer: str
    sources: List[dict]
    model: str


@app.post("/query", response_model=QueryResponse)
async def query_documents(request: QueryRequest):
    """Query your private documents using RAG."""
    try:
        result = qa_chain.invoke({"query": request.question})
        sources = [
            {
                "content": doc.page_content[:200] + "...",
                "source": doc.metadata.get("source", "unknown"),
                "page": doc.metadata.get("page", None),
            }
            for doc in result.get("source_documents", [])
        ]
        return QueryResponse(
            answer=result["result"],
            sources=sources,
            model=request.model,
        )
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))


@app.post("/upload")
async def upload_document(file: UploadFile = File(...)):
    """Upload a document for ingestion into the RAG pipeline."""
    allowed_types = [".pdf", ".docx", ".txt", ".md"]
    ext = os.path.splitext(file.filename)[1].lower()

    if ext not in allowed_types:
        raise HTTPException(
            status_code=400,
            detail=f"Unsupported file type. Allowed: {allowed_types}",
        )

    file_path = os.path.join(DOCUMENTS_DIR, file.filename)
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)

    return {"message": f"Uploaded {file.filename}", "path": file_path}


@app.get("/health")
async def health_check():
    """Health check endpoint."""
    return {"status": "healthy", "model": LLM_MODEL, "vectorstore": "chromadb"}


# Run: uvicorn api:app --host 127.0.0.1 --port 8000 --workers 2

Run the RAG Pipeline

# Step 1: Add your documents
cp /path/to/your/documents/*.pdf ~/ai-server/rag/documents/
cp /path/to/your/documents/*.docx ~/ai-server/rag/documents/

# Step 2: Run ingestion
cd ~/ai-server/rag
python ingest.py
# === Private RAG Document Ingestion ===
# Loading documents from: /Users/admin/ai-server/rag/documents
#   Loaded: company-handbook.pdf (45 pages)
#   Loaded: api-documentation.md
#   Loaded: compliance-policy.docx
# Total documents loaded: 47
# Split 47 documents into 312 chunks
# Stored 312 chunks in ChromaDB

# Step 3: Start the RAG API server
uvicorn api:app --host 127.0.0.1 --port 8000 --workers 2 &

# Step 4: Test with a query
curl -s http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"question": "What is our company vacation policy?"}' | jq .
# {
#   "answer": "According to the company handbook, employees receive...",
#   "sources": [...],
#   "model": "mistral:7b"
# }

6. Step 4: Deploy Open WebUI

Open WebUI provides a polished ChatGPT-like interface for interacting with your local models. It connects directly to Ollama and supports conversations, image generation, and document uploads -- all running privately on your Mac Mini.

Install Docker

# Install Docker Desktop for Mac (Apple Silicon native)
brew install --cask docker

# Start Docker Desktop
open -a Docker

# Verify Docker is running
docker --version
# Docker version 26.1.0, build 9714adc
docker compose version
# Docker Compose version v2.27.0

Docker Compose for Open WebUI

# ~/ai-server/docker-compose.yml
version: '3.8'

services:
  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    restart: unless-stopped
    ports:
      - "127.0.0.1:3000:8080"
    environment:
      # Connect to Ollama running on the host
      - OLLAMA_BASE_URL=http://host.docker.internal:11434
      # Disable telemetry and external connections
      - ENABLE_SIGNUP=false
      - ENABLE_COMMUNITY_SHARING=false
      - WEBUI_AUTH=true
      - WEBUI_SECRET_KEY=your-strong-secret-key-change-this
      # Data privacy settings
      - ENABLE_OPENAI_API=false
      - ENABLE_OLLAMA_API=true
      - SAFE_MODE=true
    volumes:
      - open-webui-data:/app/backend/data
    extra_hosts:
      - "host.docker.internal:host-gateway"

  # Optional: ChromaDB as a persistent service
  chromadb:
    image: chromadb/chroma:latest
    container_name: chromadb
    restart: unless-stopped
    ports:
      - "127.0.0.1:8200:8000"
    volumes:
      - chromadb-data:/chroma/chroma
    environment:
      - IS_PERSISTENT=TRUE
      - ANONYMIZED_TELEMETRY=FALSE

volumes:
  open-webui-data:
  chromadb-data:

Launch and Configure

# Start the services
cd ~/ai-server
docker compose up -d

# Check that containers are running
docker compose ps
# NAME          STATUS          PORTS
# open-webui    Up 2 minutes    127.0.0.1:3000->8080/tcp
# chromadb      Up 2 minutes    127.0.0.1:8200->8000/tcp

# View logs
docker compose logs -f open-webui

# Open WebUI is now accessible at http://localhost:3000
# On first visit, create an admin account (this account is local only)

# Verify Ollama connectivity from Open WebUI
curl -s http://localhost:3000/api/config | jq '.ollama'

# To update Open WebUI later:
docker compose pull && docker compose up -d

Privacy Note: Open WebUI is configured with ENABLE_OPENAI_API=false and ENABLE_COMMUNITY_SHARING=false to ensure no data is sent to external services. The ENABLE_SIGNUP=false setting prevents unauthorized users from creating accounts. All conversations are stored locally in the Docker volume.

7. Step 5: API Gateway with nginx

nginx serves as the single entry point for all services. It handles SSL/TLS termination, rate limiting, authentication, and routes traffic to Ollama, Open WebUI, and the RAG API.

Install and Configure nginx

# Install nginx
brew install nginx

# Generate a self-signed SSL certificate (or use Let's Encrypt)
mkdir -p ~/ai-server/config/ssl
openssl req -x509 -nodes -days 365 -newkey rsa:2048 \
    -keyout ~/ai-server/config/ssl/server.key \
    -out ~/ai-server/config/ssl/server.crt \
    -subj "/CN=ai-server.local/O=Private AI/C=US"

nginx Configuration

# /opt/homebrew/etc/nginx/nginx.conf

worker_processes auto;
error_log /Users/admin/ai-server/logs/nginx-error.log;

events {
    worker_connections 1024;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    # Logging
    access_log /Users/admin/ai-server/logs/nginx-access.log;

    # Rate limiting zones
    limit_req_zone $binary_remote_addr zone=api:10m rate=30r/m;
    limit_req_zone $binary_remote_addr zone=chat:10m rate=60r/m;
    limit_req_zone $binary_remote_addr zone=upload:10m rate=5r/m;

    # Connection limits
    limit_conn_zone $binary_remote_addr zone=addr:10m;

    # SSL settings
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 10m;

    # Security headers
    add_header X-Frame-Options DENY always;
    add_header X-Content-Type-Options nosniff always;
    add_header X-XSS-Protection "1; mode=block" always;
    add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
    add_header Content-Security-Policy "default-src 'self'" always;

    # Redirect HTTP to HTTPS
    server {
        listen 80;
        server_name ai-server.local;
        return 301 https://$host$request_uri;
    }

    # Main HTTPS server
    server {
        listen 443 ssl;
        server_name ai-server.local;

        ssl_certificate     /Users/admin/ai-server/config/ssl/server.crt;
        ssl_certificate_key /Users/admin/ai-server/config/ssl/server.key;

        # Client body size limit (for document uploads)
        client_max_body_size 50M;

        # Open WebUI (chat interface)
        location / {
            limit_req zone=chat burst=20 nodelay;
            limit_conn addr 10;

            proxy_pass http://127.0.0.1:3000;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;

            # WebSocket support for streaming
            proxy_http_version 1.1;
            proxy_set_header Upgrade $http_upgrade;
            proxy_set_header Connection "upgrade";
            proxy_read_timeout 300s;
        }

        # Ollama API (for programmatic access)
        location /ollama/ {
            limit_req zone=api burst=10 nodelay;
            limit_conn addr 5;

            # Basic auth for API access
            auth_basic "Private AI API";
            auth_basic_user_file /Users/admin/ai-server/config/.htpasswd;

            rewrite ^/ollama/(.*) /$1 break;
            proxy_pass http://127.0.0.1:11434;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_read_timeout 300s;
        }

        # RAG API
        location /rag/ {
            limit_req zone=api burst=10 nodelay;
            limit_conn addr 5;

            auth_basic "Private AI API";
            auth_basic_user_file /Users/admin/ai-server/config/.htpasswd;

            rewrite ^/rag/(.*) /$1 break;
            proxy_pass http://127.0.0.1:8000;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_read_timeout 120s;
        }

        # Document upload endpoint
        location /rag/upload {
            limit_req zone=upload burst=3 nodelay;

            auth_basic "Private AI API";
            auth_basic_user_file /Users/admin/ai-server/config/.htpasswd;

            rewrite ^/rag/(.*) /$1 break;
            proxy_pass http://127.0.0.1:8000;
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
        }

        # Health check (no auth required)
        location /health {
            proxy_pass http://127.0.0.1:8000/health;
        }

        # Deny access to hidden files
        location ~ /\. {
            deny all;
        }
    }
}

Set Up Authentication and Start

# Install htpasswd utility
brew install httpd

# Create API credentials
htpasswd -c ~/ai-server/config/.htpasswd api-user
# Enter a strong password when prompted

# Test nginx configuration
nginx -t
# nginx: configuration file /opt/homebrew/etc/nginx/nginx.conf test is successful

# Start nginx
brew services start nginx

# Test HTTPS access
curl -k https://localhost/health
# {"status": "healthy", "model": "mistral:7b", "vectorstore": "chromadb"}

# Test API access with authentication
curl -k -u api-user:your-password \
    https://localhost/ollama/api/tags | jq '.models[].name'

# Test RAG query through nginx
curl -k -u api-user:your-password \
    https://localhost/rag/query \
    -H "Content-Type: application/json" \
    -d '{"question": "What is our data retention policy?"}'

8. Security Hardening

A private AI server is only as secure as its weakest entry point. This section covers firewall configuration, SSH hardening, VPN access, and intrusion detection to create a defense-in-depth security posture.

macOS Firewall with pf

# ~/ai-server/config/pf.rules
#
# Packet Filter rules for private AI server
# Only allow SSH, HTTPS, and WireGuard VPN from outside

# Define macros
ext_if = "en0"
vpn_if = "utun1"

# Default: block everything
block all

# Allow loopback traffic
pass quick on lo0 all

# Allow established connections
pass in quick on $ext_if proto tcp from any to any flags A/A

# Allow SSH (port 22) - restrict to known IPs if possible
pass in on $ext_if proto tcp from any to any port 22

# Allow HTTPS (port 443) through nginx
pass in on $ext_if proto tcp from any to any port 443

# Allow HTTP (port 80) for redirect to HTTPS
pass in on $ext_if proto tcp from any to any port 80

# Allow WireGuard VPN (port 51820)
pass in on $ext_if proto udp from any to any port 51820

# Allow all traffic on VPN interface
pass on $vpn_if all

# Allow all outbound traffic
pass out on $ext_if all

# Block everything else inbound (implicit from "block all")
# Internal services (11434, 3000, 8000, 8200) are NOT exposed

# --- Load these rules ---
# sudo pfctl -f ~/ai-server/config/pf.rules
# sudo pfctl -e   # Enable pf
# sudo pfctl -sr  # Show active rules

SSH Key-Only Authentication

# Harden SSH configuration
# Edit /etc/ssh/sshd_config (requires sudo)

# Disable password authentication (key-only)
PasswordAuthentication no
ChallengeResponseAuthentication no
UsePAM no

# Disable root login
PermitRootLogin no

# Only allow specific users
AllowUsers admin

# Use strong key exchange algorithms
KexAlgorithms curve25519-sha256,curve25519-sha256@libssh.org
HostKeyAlgorithms ssh-ed25519,rsa-sha2-512,rsa-sha2-256
Ciphers chacha20-poly1305@openssh.com,aes256-gcm@openssh.com

# Reduce login grace time and max attempts
LoginGraceTime 30
MaxAuthTries 3
MaxSessions 5

# Disable unused features
X11Forwarding no
AllowTcpForwarding no
AllowAgentForwarding no

# Restart SSH
# sudo launchctl stop com.openssh.sshd
# sudo launchctl start com.openssh.sshd

WireGuard VPN for Secure Remote Access

# Install WireGuard
brew install wireguard-tools

# Generate server keys
wg genkey | tee ~/ai-server/config/wg-server-private.key | \
    wg pubkey > ~/ai-server/config/wg-server-public.key

# Generate client keys
wg genkey | tee ~/ai-server/config/wg-client-private.key | \
    wg pubkey > ~/ai-server/config/wg-client-public.key

# Server configuration
cat <<EOF > ~/ai-server/config/wg0.conf
[Interface]
PrivateKey = $(cat ~/ai-server/config/wg-server-private.key)
Address = 10.66.66.1/24
ListenPort = 51820
PostUp = echo "WireGuard started"
PostDown = echo "WireGuard stopped"

[Peer]
# Client 1 - Your workstation
PublicKey = $(cat ~/ai-server/config/wg-client-public.key)
AllowedIPs = 10.66.66.2/32
PersistentKeepalive = 25
EOF

# Client configuration (copy to your workstation)
cat <<EOF > ~/ai-server/config/wg-client.conf
[Interface]
PrivateKey = $(cat ~/ai-server/config/wg-client-private.key)
Address = 10.66.66.2/24
DNS = 1.1.1.1

[Peer]
PublicKey = $(cat ~/ai-server/config/wg-server-public.key)
Endpoint = your-mac-mini.myremotemac.com:51820
AllowedIPs = 10.66.66.0/24
PersistentKeepalive = 25
EOF

# Start WireGuard on the server
sudo wg-quick up ~/ai-server/config/wg0.conf

# Verify connection
sudo wg show
# interface: utun1
#   public key: 
#   listening port: 51820
#
# peer: 
#   allowed ips: 10.66.66.2/32

Login Attempt Monitoring (fail2ban equivalent)

# ~/ai-server/scripts/monitor-ssh.sh
#!/bin/bash
# Simple SSH brute-force detection and blocking for macOS
# Run via cron every 5 minutes

LOG_FILE="/var/log/system.log"
BLOCK_THRESHOLD=5
BLOCK_FILE="$HOME/ai-server/config/blocked_ips.txt"

# Find IPs with failed SSH attempts in the last 10 minutes
failed_ips=$(log show --predicate 'process == "sshd" AND eventMessage CONTAINS "Failed"' \
    --last 10m 2>/dev/null | \
    grep -oE '([0-9]{1,3}\.){3}[0-9]{1,3}' | \
    sort | uniq -c | sort -rn)

echo "$failed_ips" | while read count ip; do
    if [ "$count" -ge "$BLOCK_THRESHOLD" ] && [ -n "$ip" ]; then
        # Check if already blocked
        if ! grep -q "$ip" "$BLOCK_FILE" 2>/dev/null; then
            echo "$(date): Blocking $ip ($count failed attempts)" >> ~/ai-server/logs/security.log
            echo "$ip" >> "$BLOCK_FILE"

            # Add pf block rule
            echo "block in quick from $ip to any" | sudo pfctl -f - -a "blocked/$ip" 2>/dev/null
        fi
    fi
done

# Make executable and add to crontab:
# chmod +x ~/ai-server/scripts/monitor-ssh.sh
# crontab -e
# */5 * * * * ~/ai-server/scripts/monitor-ssh.sh

Security Checklist:

SSH key-only authentication enabled, passwords disabled
All AI services bound to localhost only (127.0.0.1)
nginx is the only public-facing service (ports 80/443)
pf firewall blocks all non-essential inbound traffic
WireGuard VPN for secure remote administration
Rate limiting on all API endpoints
Basic auth on Ollama and RAG API endpoints
Automated brute-force detection and IP blocking
Security headers (HSTS, CSP, X-Frame-Options) on all responses

9. Cost Analysis

The most compelling argument for a private AI server is cost. Cloud AI APIs charge per token, and costs escalate rapidly at scale. A Mac Mini M4 provides unlimited inference at a fixed monthly rate.

OpenAI API vs. Private Mac Mini M4

Comparison assumes GPT-3.5-Turbo pricing ($0.50/1M input tokens, $1.50/1M output tokens) vs. a dedicated Mac Mini M4 running Llama 3 8B. Average request: 500 input tokens + 300 output tokens.

Monthly Requests	OpenAI API Cost	Mac Mini M4 Cost	Savings
1,000	$0.70	$75	-$74.30 (API cheaper)
10,000	$7.00	$75	-$68.00 (API cheaper)
100,000	$70.00	$75	~Break-even
500,000	$350.00	$75	$275 (79% savings)
1,000,000	$700.00	$75	$625 (89% savings)

GPT-4 Level Comparison

For GPT-4-level quality, compare GPT-4-Turbo ($10/1M input, $30/1M output) vs. Mac Mini M4 Pro 48GB running Llama 3 70B.

Monthly Requests	GPT-4 Turbo Cost	Mac Mini M4 Pro Cost	Savings
10,000	$140.00	$179	~Break-even
100,000	$1,400.00	$179	$1,221 (87% savings)
1,000,000	$14,000.00	$179	$13,821 (99% savings)

Beyond Cost: The real value of a private AI server is not just financial savings. It is the elimination of vendor dependency, the guarantee of data privacy, and the freedom to iterate without worrying about API billing. Your costs stay flat whether you run 100 requests or 10 million.

10. Scaling

A single Mac Mini M4 can handle 2-4 concurrent requests for a 7B model. When you need more throughput or want model specialization, scaling horizontally with multiple Mac Minis is straightforward.

Multi-Node Architecture

Node 1: General Chat

Mac Mini M4 16GB

Llama 3 8B for general-purpose conversations, customer support, and content generation.

$75/mo

Node 2: Code Assistant

Mac Mini M4 24GB

CodeLlama 13B for code generation, review, and refactoring tasks.

$95/mo

Node 3: RAG & Reasoning

Mac Mini M4 Pro 48GB

Llama 3 70B for complex document analysis, legal research, and deep reasoning tasks.

$179/mo

Load Balancing with nginx

# nginx upstream configuration for multi-node load balancing

# Define upstream groups by model type
upstream ollama_general {
    # Round-robin across general chat nodes
    server 10.66.66.10:11434;  # Node 1
    server 10.66.66.11:11434;  # Node 1 replica (if needed)
    keepalive 8;
}

upstream ollama_code {
    server 10.66.66.20:11434;  # Node 2 - Code models
    keepalive 4;
}

upstream ollama_reasoning {
    server 10.66.66.30:11434;  # Node 3 - Large models
    keepalive 4;
}

# Model routing based on request path
server {
    listen 443 ssl;
    server_name ai-cluster.local;

    # Route general chat requests
    location /v1/chat/ {
        proxy_pass http://ollama_general;
        proxy_read_timeout 300s;
    }

    # Route code generation requests
    location /v1/code/ {
        proxy_pass http://ollama_code;
        proxy_read_timeout 300s;
    }

    # Route reasoning/analysis requests
    location /v1/reasoning/ {
        proxy_pass http://ollama_reasoning;
        proxy_read_timeout 600s;
    }
}

Model Routing Script

# ~/ai-server/scripts/model_router.py
"""
Intelligent model router that directs requests to the appropriate
Mac Mini node based on the requested model and current load.
"""
from fastapi import FastAPI, Request
import httpx
import asyncio

app = FastAPI()

NODES = {
    "general": {
        "url": "http://10.66.66.10:11434",
        "models": ["llama3:8b", "mistral:7b"],
    },
    "code": {
        "url": "http://10.66.66.20:11434",
        "models": ["codellama:7b", "codellama:13b"],
    },
    "reasoning": {
        "url": "http://10.66.66.30:11434",
        "models": ["llama3:70b", "mixtral:8x7b"],
    },
}

def get_node_for_model(model: str) -> str:
    """Find which node hosts the requested model."""
    for node_name, config in NODES.items():
        if model in config["models"]:
            return config["url"]
    # Default to general node
    return NODES["general"]["url"]

@app.post("/v1/chat/completions")
async def route_chat(request: Request):
    body = await request.json()
    model = body.get("model", "llama3:8b")
    target_url = get_node_for_model(model)

    async with httpx.AsyncClient(timeout=300) as client:
        response = await client.post(
            f"{target_url}/v1/chat/completions",
            json=body,
        )
        return response.json()

@app.get("/v1/models")
async def list_all_models():
    """Aggregate model lists from all nodes."""
    all_models = []
    async with httpx.AsyncClient(timeout=10) as client:
        for node_name, config in NODES.items():
            try:
                resp = await client.get(f"{config['url']}/api/tags")
                models = resp.json().get("models", [])
                for m in models:
                    m["node"] = node_name
                all_models.extend(models)
            except Exception:
                pass
    return {"models": all_models}

# Run: uvicorn model_router:app --host 0.0.0.0 --port 8080

11. Frequently Asked Questions

Is a self-hosted AI server truly private if I rent the hardware?

Yes. When you rent a dedicated Mac Mini from My Remote Mac, you get exclusive access to the physical hardware. No other customer shares your machine. All data is stored on your server's SSD, encrypted at rest. SSH keys ensure only you have access. When you end your subscription, the drive is securely wiped. This is fundamentally different from shared cloud VMs where other tenants run on the same physical host.

Can I meet GDPR and HIPAA compliance with this setup?

A private AI server addresses the core technical requirements of GDPR (data stays within your control, no third-party processing without consent) and HIPAA (PHI is not transmitted to cloud AI providers). However, full compliance also requires organizational controls, audit logging, encryption policies, and potentially a BAA with your hosting provider. Use this setup as the technical foundation and work with your compliance team for the complete picture.

How does the quality of local models compare to GPT-4 or Claude?

For specific, well-defined tasks (document Q&A, code generation, summarization, classification), open-source models like Llama 3 8B and Mistral 7B achieve 85-95% of GPT-3.5 quality. Llama 3 70B approaches GPT-4 quality on many benchmarks. The quality gap narrows with RAG pipelines, where retrieving the right context matters more than raw model capability. For general creative writing or complex multi-step reasoning, frontier cloud models still have an edge.

What happens if the server goes down?

All services are configured with KeepAlive (Ollama via launchd) and restart: unless-stopped (Docker containers). If the Mac Mini reboots, all services restart automatically. For production workloads, consider running two Mac Minis with nginx load balancing for high availability. My Remote Mac infrastructure includes 24/7 monitoring and redundant power and network connectivity.

Can I use this setup with existing tools like Cursor, Continue.dev, or VS Code?

Absolutely. Since Ollama provides an OpenAI-compatible API, any tool that can connect to an OpenAI endpoint can use your private server. In Cursor or Continue.dev, point the API base URL to https://your-server/ollama/v1 with your basic auth credentials. VS Code extensions like Continue, Cody, and Tabby all support custom endpoints. Your code never leaves your server.

How do I update models when new versions are released?

Updating models is a single command: ollama pull llama3:8b will download the latest version. The old version is kept until you explicitly remove it with ollama rm. You can test new model versions alongside existing ones and switch in production with zero downtime by updating the model name in your API calls or nginx configuration.

What about model licensing? Can I use Llama 3 commercially?

Llama 3 is released under the Meta Llama 3 Community License which permits commercial use for organizations with under 700 million monthly active users. Mistral models are released under the Apache 2.0 license (fully permissive). CodeLlama follows the Llama 2 Community License. Always check the specific license for each model you deploy, but for the vast majority of businesses, these models are freely usable in production.

Build a Private AI Server with Mac Mini M4 No Cloud APIs Needed

Table of Contents