Local Inference at the Edge: Gemma on Mac Minis

Deploying quantized models across a LAN cluster with Tailscale mesh, shadow-mode calibration, and the economics of cloud vs. local.

Every conversation about AI costs eventually reduces to the same question: can you run it locally?

For the kind of work I'm doing with Strata — where the Match Agent needs to deep-score dozens of job listings per day, and the Scraper Agent needs to OCR career pages full of non-standard formatting — the cloud API costs are manageable but not negligible. More importantly, they scale linearly: more listings to evaluate means proportionally more spend. That's a bad cost curve for an autonomous system.

So I deployed two Mac Minis running local model inference via Apple Silicon. Here's what worked, what didn't, and what I learned about the real tradeoffs.

The infrastructure

Machine	Spec	Primary Role
mac-mini-1	M4, 16GB RAM, 512GB SSD	Strata services + OCR (GLM-OCR-2B via MLX)
mac-mini-2	M4, 16GB RAM, 1TB SSD	LLM inference (Gemma 4 E4B, 4-bit quantized via MLX)

Both machines are connected via direct Ethernet on the same LAN. Each machine also runs a Tailscale node, so when I'm at a coffee shop or library, my laptop can reach them through an encrypted WireGuard mesh. No port forwarding, no dynamic DNS, no exposed services.

Services run as launchd daemons — macOS's native service manager. They start on boot, restart on crash, and log to standard system locations. No Docker, no Kubernetes. These are two machines running three services. The infrastructure should be proportional to the complexity.

Model selection: why Gemma, why quantized

I evaluated several local model options. The constraints were real: with only 16GB of unified memory, the model had to be lean enough to leave headroom for the OS and services, while still achieving acceptable quality on job-description scoring. It had to support efficient inference on Apple Silicon via MLX.

Gemma 4 E4B (4-bit quantized) hit the sweet spot. At 4-bit quantization via mlx-community/gemma-4-E4B-it-4bit, the model runs with a peak memory footprint of just 5.25 GB — roughly a third of available unified memory, leaving enough for the OS, services, and comfortable operation. Performance is genuinely impressive on M4:

Metric	Measured
End-to-end latency (over LAN)	0.93 seconds
Prompt throughput	66.8 tokens/sec
Generation throughput	66.1 tokens/sec
Peak memory	5.25 GB
Model	`mlx-community/gemma-4-E4B-it-4bit`

Sub-second latency over LAN and 66+ tokens/sec generation speed — more than adequate for scoring tasks, and fast enough for interactive use if needed. With bigger models (27B+), you'd want more memory. Like many people building on Apple Silicon right now, I'm eagerly watching the M5 Studio announcements — 192GB of unified memory would open up entirely different model classes. But the point of this exercise is getting creative and being productive with what's already on the desk, not waiting for ideal hardware.

For OCR, GLM-OCR-2B via MLX runs on the first Mac Mini. It handles parsing career pages, extracting text from images, and processing PDF resumes. At 2B parameters, it's fast and accurate for structured text extraction.

Shadow-mode calibration

Here's the part most teams skip. You can't just deploy a local model and hope it's "good enough." You need data proving it.

Shadow mode means running both the cloud model (GPT-4o-mini or Claude Haiku) and the local model (Gemma) on the same input simultaneously, then comparing results. You keep using the cloud model for production scoring while silently collecting local model outputs for evaluation.

The metrics we track:

Metric	What It Measures	Target
Pearson R correlation	Numerical score agreement between cloud and local	≥ 0.85
Routing-tier agreement	Same match tier (Strong/Medium/Weak/No) assigned	≥ 90%
Latency per evaluation	Time-to-score for a single listing	≤ 30s
Failure rate	Malformed or empty responses	≤ 2%

When the local model consistently meets these thresholds on a large enough sample (100+ evaluations), you can start routing production traffic to it. The routing is gradual: start with "easy" matches (high-confidence Strong or No Match), then expand the routing as confidence grows.

The economic argument

The numbers are straightforward:

Scenario	Monthly Cost	Notes
All cloud (current)	$7-10/mo in API costs	GPT-4o-mini + Haiku + Sonnet, with 80/15/5 routing
Hybrid (target)	$2-4/mo in API costs	Local handles 80%+ of scoring; cloud for hard cases
Electricity for two Mac Minis	~$8-10/mo	Idle ~5-7W each; inference ~25-35W each

The Mac Minis were a one-time capital expense. At current API prices, the payback is measured in months if you're running any serious volume. But the real value isn't the dollar savings — it's the cost curve. Cloud costs scale linearly with volume. Local costs are fixed. When you're building an autonomous system that should run 24/7 without human attention, fixed costs are structurally better.

The honest counter-argument: API costs are dropping fast. OpenAI, Anthropic, and Google are in a price war. Running local inference today might not make economic sense in 12 months. What local inference gives you that APIs can't: data stays on your network (privacy), zero rate limits, no vendor dependency, and the ability to fine-tune. Whether those matter depends on your use case.

Lessons learned

MLX is the secret weapon for Apple Silicon. Apple's MLX framework makes running quantized models on M-series chips genuinely practical. The unified memory architecture means no GPU memory management, no CUDA dependencies. pip install mlx-lm, download the quantized weights, and you're running inference.

Tailscale makes home infrastructure viable. Before Tailscale, running services on home machines meant either opening ports (security risk) or setting up a VPN server (maintenance burden). Tailscale gives you a zero-config, encrypted mesh network. Every machine gets a stable hostname. It just works.

launchd for services is underrated. The containerization instinct is strong, but macOS's native service management is excellent for this scale. launchd plists handle restarts, logging, environment variables, and startup dependencies. Two machines, three services. No orchestration layer needed.

Calibration before cutover, always. The temptation is to deploy the local model and immediately start using it. Don't. Shadow mode exists because models have failure modes that aren't obvious in synthetic benchmarks. Real job descriptions have weird formatting, inconsistent structures, and domain-specific jargon. You need real-world evaluation data.

Infrastructure complexity should be proportional. Two Mac Minis connected by Ethernet cable, running three Python services as launchd daemons, with Tailscale for remote access. That's the entire infrastructure. It would take an afternoon to set up from scratch. If your ML infrastructure requires a platform team to maintain, it's probably over-engineered for your actual needs.