← Andrew Crenshaw
Build log · Updated May 2026

Local Inference at the Edge: A Four-Machine Fleet

From two Mac Minis to a four-machine Apple Silicon cluster: schedule-driven model switching, oMLX routing, and what it looks like to wire local inference into every layer of the stack.

Every conversation about running local AI eventually hits the same question: why bother?

The honest answer isn't "API costs." I've spent more on hardware than I'd spend on API costs in years of current-volume operation. That's not the story.

The story is control, privacy, and what you learn by actually running the models. For the kind of work I'm doing with Strata, where the job-scoring pipeline handles salary data, career targets, and interview coaching content, everything staying local isn't optional. Beyond that: when you manage your own inference fleet, you configure temperature, quantization, context windows, and thinking-on/thinking-off per workload. You can't do any of that through an API. And frontier lab pricing is demonstrably subsidized; tightening usage tiers are already the story. Converting that uncertainty to a fixed electricity bill is structurally sound.

So I built a four-machine cluster, three Mac Minis and a Mac Studio, running local model inference on Apple Silicon via oMLX. Here's what the setup looks like today and what I've learned from running it.

The infrastructure

MachineSpecPrimary Role
mac-studio1M3 Ultra, 96GB RAM, 28 coresPrimary oMLX host: all production inference (Strata + Alexandria); OWC storage hub (22TB SMB exports)
mac-mini3M4 Pro, 64GB RAM, 4TB internalSecondary oMLX host: Alexandria librarian + overnight synthesis; Strata failover
mac-mini1M4, 16GB RAM, 228GB internalEdge oMLX: extract-fast + embed-fast, always-on
mac-mini2M4, 16GB RAM, 228GB internalEdge oMLX mirror: round-robin load balance + failover for mini1

All four machines are connected via Ethernet on the same LAN. Each also runs a Tailscale node, giving my laptop encrypted WireGuard access from anywhere. No port forwarding, no dynamic DNS, no exposed services.

Inference runs as launchd daemons, macOS's native service manager. They start on boot, restart on crash, and log to standard locations. No Docker, no Kubernetes, no orchestration layer. The infrastructure should be proportional to the actual complexity.

The inference engine: oMLX, not Ollama or vanilla MLX

The serving runtime matters more than I expected. The fleet runs oMLX 0.3.8, a purpose-built Mac inference server that's distinct from both Apple's MLX framework and Ollama. The difference: oMLX adds DFlash (token-budget-aware attention caching for single-request throughput), TurboQuant (dynamic quantization at high context lengths), and continuous-batching profiles for multi-agent fan-out. It exposes an OpenAI-compatible API on port 8000, which means any client that speaks that protocol can route to it.

The fleet is organized into three tiers by hardware capability:

Heavy tier, mac-studio1 (M3 Ultra, 96GB, ~820 GB/s memory bandwidth) carries all production online inference. Two Qwen3.6 models resident simultaneously:

ModelSizeSlots
Qwen3.6-35B-A3B-Abliterated-MLX-8bit (MoE)~36 GBchat-interactive, score-fast, score-deep, extract-fast
Qwen3.6-27B-MLX-8bit (Dense)~27 GBsynth-deep, librarian-curate
DFlash draft weight (~0.93 GB)<1 GBRequired for DFlash-ON profile on the 35B MoE

With DFlash enabled on the 35B MoE, interactive throughput reaches 245–326 tok/s for single requests. The 27B Dense model runs with thinking enabled for synthesis work, it hits 91.9% TruthfulQA, which matters when generating interview coaching content or Alexandria knowledge bootstraps where accuracy is the constraint, not speed.

Medium tier, mac-mini3 (M4 Pro, 64GB, 4TB internal) is the dedicated Alexandria worker. One model loaded permanently:

ModelSizeProfile
Qwen3.6-27B-MLX-8bit (Dense)~27 GBThinking ON, TurboQuant ON at ≥65K context, 65K context window

mini3 runs the nightly Alexandria schedule, synthesis-detector, wiki-lint, link-builder, dedup, on its own hardware so it never competes with studio1's live traffic. With 64GB and a 4TB local drive for model storage, it handles librarian-curate passes that take an hour without impacting anything else on the fleet.

Edge tier, mac-mini1 + mac-mini2 (M4, 16GB each, mirrored config) serve Living Memory's near-real-time slots. Two models co-loaded per machine (~2.5 GB total):

ModelSizeSlot
Qwen3.6-2B-MLX-4bit~1.6 GBextract-fast: dup-check, chunk summarize, quick classification
snowflake-arctic-embed-l-v2.0-8bit~600 MBembed-fast: 1024-dim embeddings

The edge machines exist for one reason: isolation. When an agent is actively coding and a duplicate-check fires, I want a sub-3s response regardless of what studio1 is doing (which might be a long synthesis job or a Strata batch run). mini1 is primary, mini2 is immediate failover. Resolver routes extract-fast and embed-fast to the edge tier first; falls back to studio1 only if both edge machines are unreachable.

The slot layer: how applications talk to the fleet

Application code doesn't call specific models or machines, it calls slots. The role-resolver maps each slot to a (model_name, profile, host) tuple at runtime. Change the binding, not the code.

SlotHostModelUsed For
embed-fastmini1/2snowflake-arctic-embed-lEmbeddings for Living Memory retrieval
extract-fastmini1/2 → studio12B MLX-4bit → 35B MoEDup-check, chunk summarize, quick extraction
score-faststudio135B MoE + DFlash + continuous batchingStrata Stage 2A: fast scoring, ~290 tok/s aggregate
score-deepstudio135B MoE, no DFlashStrata Stage 2B: deep evaluation
chat-interactivestudio135B MoE + DFlashInteractive agent turns, 245–326 tok/s
synth-deepstudio1 / mini327B Dense, thinking ONCoach synthesis, Alexandria knowledge bootstrap
librarian-curatemini3 (nightly)27B Dense, thinking ON, 65K ctxWiki lint, link-builder, dedup, staleness detection

One oMLX quirk worth knowing: 0.3.8 supports one profile per loaded model name. To run the same base model under two different profiles (e.g., 27B Dense with thinking ON for synthesis vs. thinking OFF for fast scoring), you load it under two distinct names using symlinks. Same weights on disk, two registered model IDs, two independent profile configs. It doesn't double memory usage as long as only one is active at a time.

Alexandria's knowledge operations, session priming, daily digests, knowledge queries, route to mini3 and studio1 via the same slot abstraction. When mini3 is running a long librarian pass, studio1's live-traffic capacity is untouched. Retrieval happens locally against SQLite/LanceDB on the calling machine; the LLM only runs for synthesis steps that actually need it.

Shadow-mode calibration

Here's the part most teams skip. You can't just deploy a local model and hope it's "good enough." You need data proving it.

Shadow mode means running both the cloud model and the local model on the same input simultaneously, then comparing results. You keep using the cloud model for production scoring while silently collecting local model outputs for evaluation. The local model earns production traffic by demonstrating agreement on real data, not synthetic benchmarks.

The metrics we track:

MetricWhat It MeasuresTarget
Pearson R correlationNumerical score agreement between cloud and local≥ 0.85
Routing-tier agreementSame match tier (Strong/Medium/Weak/No) assigned≥ 90%
Latency per evaluationTime-to-score for a single listing≤ 30s
Failure rateMalformed or empty responses≤ 2%

When the local model consistently meets these thresholds on a large enough sample (100+ evaluations), you can start routing production traffic to it. The routing is gradual: start with "easy" matches (high-confidence Strong or No Match), then expand the routing as confidence grows.

← All writing