Local Inference at the Edge: A Four-Machine Fleet
From two Mac Minis to a four-machine Apple Silicon cluster: schedule-driven model switching, oMLX routing, and what it looks like to wire local inference into every layer of the stack.
Every conversation about running local AI eventually hits the same question: why bother?
The honest answer isn't "API costs." I've spent more on hardware than I'd spend on API costs in years of current-volume operation. That's not the story.
The story is control, privacy, and what you learn by actually running the models. For the kind of work I'm doing with Strata, where the job-scoring pipeline handles salary data, career targets, and interview coaching content, everything staying local isn't optional. Beyond that: when you manage your own inference fleet, you configure temperature, quantization, context windows, and thinking-on/thinking-off per workload. You can't do any of that through an API. And frontier lab pricing is demonstrably subsidized; tightening usage tiers are already the story. Converting that uncertainty to a fixed electricity bill is structurally sound.
So I built a four-machine cluster, three Mac Minis and a Mac Studio, running local model inference on Apple Silicon via oMLX. Here's what the setup looks like today and what I've learned from running it.
The infrastructure
| Machine | Spec | Primary Role |
|---|---|---|
| mac-studio1 | M3 Ultra, 96GB RAM, 28 cores | Primary oMLX host: all production inference (Strata + Alexandria); OWC storage hub (22TB SMB exports) |
| mac-mini3 | M4 Pro, 64GB RAM, 4TB internal | Secondary oMLX host: Alexandria librarian + overnight synthesis; Strata failover |
| mac-mini1 | M4, 16GB RAM, 228GB internal | Edge oMLX: extract-fast + embed-fast, always-on |
| mac-mini2 | M4, 16GB RAM, 228GB internal | Edge oMLX mirror: round-robin load balance + failover for mini1 |
All four machines are connected via Ethernet on the same LAN. Each also runs a Tailscale node, giving my laptop encrypted WireGuard access from anywhere. No port forwarding, no dynamic DNS, no exposed services.
Inference runs as launchd daemons, macOS's native service manager. They start on boot, restart on crash, and log to standard locations. No Docker, no Kubernetes, no orchestration layer. The infrastructure should be proportional to the actual complexity.
The inference engine: oMLX, not Ollama or vanilla MLX
The serving runtime matters more than I expected. The fleet runs oMLX 0.3.8, a purpose-built Mac inference server that's distinct from both Apple's MLX framework and Ollama. The difference: oMLX adds DFlash (token-budget-aware attention caching for single-request throughput), TurboQuant (dynamic quantization at high context lengths), and continuous-batching profiles for multi-agent fan-out. It exposes an OpenAI-compatible API on port 8000, which means any client that speaks that protocol can route to it.
The fleet is organized into three tiers by hardware capability:
Heavy tier, mac-studio1 (M3 Ultra, 96GB, ~820 GB/s memory bandwidth) carries all production online inference. Two Qwen3.6 models resident simultaneously:
| Model | Size | Slots |
|---|---|---|
Qwen3.6-35B-A3B-Abliterated-MLX-8bit (MoE) | ~36 GB | chat-interactive, score-fast, score-deep, extract-fast |
Qwen3.6-27B-MLX-8bit (Dense) | ~27 GB | synth-deep, librarian-curate |
| DFlash draft weight (~0.93 GB) | <1 GB | Required for DFlash-ON profile on the 35B MoE |
With DFlash enabled on the 35B MoE, interactive throughput reaches 245–326 tok/s for single requests. The 27B Dense model runs with thinking enabled for synthesis work, it hits 91.9% TruthfulQA, which matters when generating interview coaching content or Alexandria knowledge bootstraps where accuracy is the constraint, not speed.
Medium tier, mac-mini3 (M4 Pro, 64GB, 4TB internal) is the dedicated Alexandria worker. One model loaded permanently:
| Model | Size | Profile |
|---|---|---|
Qwen3.6-27B-MLX-8bit (Dense) | ~27 GB | Thinking ON, TurboQuant ON at ≥65K context, 65K context window |
mini3 runs the nightly Alexandria schedule, synthesis-detector, wiki-lint, link-builder, dedup, on its own hardware so it never competes with studio1's live traffic. With 64GB and a 4TB local drive for model storage, it handles librarian-curate passes that take an hour without impacting anything else on the fleet.
Edge tier, mac-mini1 + mac-mini2 (M4, 16GB each, mirrored config) serve Living Memory's near-real-time slots. Two models co-loaded per machine (~2.5 GB total):
| Model | Size | Slot |
|---|---|---|
Qwen3.6-2B-MLX-4bit | ~1.6 GB | extract-fast: dup-check, chunk summarize, quick classification |
snowflake-arctic-embed-l-v2.0-8bit | ~600 MB | embed-fast: 1024-dim embeddings |
The edge machines exist for one reason: isolation. When an agent is actively coding and a duplicate-check fires, I want a sub-3s response regardless of what studio1 is doing (which might be a long synthesis job or a Strata batch run). mini1 is primary, mini2 is immediate failover. Resolver routes extract-fast and embed-fast to the edge tier first; falls back to studio1 only if both edge machines are unreachable.
The slot layer: how applications talk to the fleet
Application code doesn't call specific models or machines, it calls slots. The role-resolver maps each slot to a (model_name, profile, host) tuple at runtime. Change the binding, not the code.
| Slot | Host | Model | Used For |
|---|---|---|---|
embed-fast | mini1/2 | snowflake-arctic-embed-l | Embeddings for Living Memory retrieval |
extract-fast | mini1/2 → studio1 | 2B MLX-4bit → 35B MoE | Dup-check, chunk summarize, quick extraction |
score-fast | studio1 | 35B MoE + DFlash + continuous batching | Strata Stage 2A: fast scoring, ~290 tok/s aggregate |
score-deep | studio1 | 35B MoE, no DFlash | Strata Stage 2B: deep evaluation |
chat-interactive | studio1 | 35B MoE + DFlash | Interactive agent turns, 245–326 tok/s |
synth-deep | studio1 / mini3 | 27B Dense, thinking ON | Coach synthesis, Alexandria knowledge bootstrap |
librarian-curate | mini3 (nightly) | 27B Dense, thinking ON, 65K ctx | Wiki lint, link-builder, dedup, staleness detection |
One oMLX quirk worth knowing: 0.3.8 supports one profile per loaded model name. To run the same base model under two different profiles (e.g., 27B Dense with thinking ON for synthesis vs. thinking OFF for fast scoring), you load it under two distinct names using symlinks. Same weights on disk, two registered model IDs, two independent profile configs. It doesn't double memory usage as long as only one is active at a time.
Alexandria's knowledge operations, session priming, daily digests, knowledge queries, route to mini3 and studio1 via the same slot abstraction. When mini3 is running a long librarian pass, studio1's live-traffic capacity is untouched. Retrieval happens locally against SQLite/LanceDB on the calling machine; the LLM only runs for synthesis steps that actually need it.
Shadow-mode calibration
Here's the part most teams skip. You can't just deploy a local model and hope it's "good enough." You need data proving it.
The metrics we track:
| Metric | What It Measures | Target |
|---|---|---|
| Pearson R correlation | Numerical score agreement between cloud and local | ≥ 0.85 |
| Routing-tier agreement | Same match tier (Strong/Medium/Weak/No) assigned | ≥ 90% |
| Latency per evaluation | Time-to-score for a single listing | ≤ 30s |
| Failure rate | Malformed or empty responses | ≤ 2% |
When the local model consistently meets these thresholds on a large enough sample (100+ evaluations), you can start routing production traffic to it. The routing is gradual: start with "easy" matches (high-confidence Strong or No Match), then expand the routing as confidence grows.
← All writing