Every conversation about AI costs eventually reduces to the same question: can you run it locally?
For the kind of work I'm doing with Strata — where the Match Agent needs to deep-score dozens of job listings per day, and the Scraper Agent needs to OCR career pages full of non-standard formatting — the cloud API costs are manageable but not negligible. More importantly, they scale linearly: more listings to evaluate means proportionally more spend. That's a bad cost curve for an autonomous system.
So I deployed two Mac Minis running local model inference via Apple Silicon. Here's what worked, what didn't, and what I learned about the real tradeoffs.
The infrastructure
| Machine | Spec | Primary Role |
|---|---|---|
| mac-mini-1 | M4, 16GB RAM, 512GB SSD | Strata services + OCR (GLM-OCR-2B via MLX) |
| mac-mini-2 | M4, 16GB RAM, 1TB SSD | LLM inference (Gemma 4 E4B, 4-bit quantized via MLX) |
Both machines are connected via direct Ethernet on the same LAN. Each machine also runs a Tailscale node, so when I'm at a coffee shop or library, my laptop can reach them through an encrypted WireGuard mesh. No port forwarding, no dynamic DNS, no exposed services.
Services run as launchd daemons — macOS's native service manager. They start on boot, restart on crash, and log to standard system locations. No Docker, no Kubernetes. These are two machines running three services. The infrastructure should be proportional to the complexity.
Model selection: why Gemma, why quantized
I evaluated several local model options. The constraints were real: with only 16GB of unified memory, the model had to be lean enough to leave headroom for the OS and services, while still achieving acceptable quality on job-description scoring. It had to support efficient inference on Apple Silicon via MLX.
Gemma 4 E4B (4-bit quantized) hit the sweet spot. At 4-bit quantization via mlx-community/gemma-4-E4B-it-4bit, the model runs with a peak memory footprint of just 5.25 GB — roughly a third of available unified memory, leaving enough for the OS, services, and comfortable operation. Performance is genuinely impressive on M4:
| Metric | Measured |
|---|---|
| End-to-end latency (over LAN) | 0.93 seconds |
| Prompt throughput | 66.8 tokens/sec |
| Generation throughput | 66.1 tokens/sec |
| Peak memory | 5.25 GB |
| Model | mlx-community/gemma-4-E4B-it-4bit |
Sub-second latency over LAN and 66+ tokens/sec generation speed — more than adequate for scoring tasks, and fast enough for interactive use if needed. With bigger models (27B+), you'd want more memory. Like many people building on Apple Silicon right now, I'm eagerly watching the M5 Studio announcements — 192GB of unified memory would open up entirely different model classes. But the point of this exercise is getting creative and being productive with what's already on the desk, not waiting for ideal hardware.
For OCR, GLM-OCR-2B via MLX runs on the first Mac Mini. It handles parsing career pages, extracting text from images, and processing PDF resumes. At 2B parameters, it's fast and accurate for structured text extraction.
Shadow-mode calibration
Here's the part most teams skip. You can't just deploy a local model and hope it's "good enough." You need data proving it.
The metrics we track:
| Metric | What It Measures | Target |
|---|---|---|
| Pearson R correlation | Numerical score agreement between cloud and local | ≥ 0.85 |
| Routing-tier agreement | Same match tier (Strong/Medium/Weak/No) assigned | ≥ 90% |
| Latency per evaluation | Time-to-score for a single listing | ≤ 30s |
| Failure rate | Malformed or empty responses | ≤ 2% |
When the local model consistently meets these thresholds on a large enough sample (100+ evaluations), you can start routing production traffic to it. The routing is gradual: start with "easy" matches (high-confidence Strong or No Match), then expand the routing as confidence grows.
The economic argument
The numbers are straightforward:
| Scenario | Monthly Cost | Notes |
|---|---|---|
| All cloud (current) | $7-10/mo in API costs | GPT-4o-mini + Haiku + Sonnet, with 80/15/5 routing |
| Hybrid (target) | $2-4/mo in API costs | Local handles 80%+ of scoring; cloud for hard cases |
| Electricity for two Mac Minis | ~$8-10/mo | Idle ~5-7W each; inference ~25-35W each |
The Mac Minis were a one-time capital expense. At current API prices, the payback is measured in months if you're running any serious volume. But the real value isn't the dollar savings — it's the cost curve. Cloud costs scale linearly with volume. Local costs are fixed. When you're building an autonomous system that should run 24/7 without human attention, fixed costs are structurally better.
Lessons learned
MLX is the secret weapon for Apple Silicon. Apple's MLX framework makes running quantized models on M-series chips genuinely practical. The unified memory architecture means no GPU memory management, no CUDA dependencies. pip install mlx-lm, download the quantized weights, and you're running inference.
Tailscale makes home infrastructure viable. Before Tailscale, running services on home machines meant either opening ports (security risk) or setting up a VPN server (maintenance burden). Tailscale gives you a zero-config, encrypted mesh network. Every machine gets a stable hostname. It just works.
launchd for services is underrated. The containerization instinct is strong, but macOS's native service management is excellent for this scale. launchd plists handle restarts, logging, environment variables, and startup dependencies. Two machines, three services. No orchestration layer needed.
Calibration before cutover, always. The temptation is to deploy the local model and immediately start using it. Don't. Shadow mode exists because models have failure modes that aren't obvious in synthetic benchmarks. Real job descriptions have weird formatting, inconsistent structures, and domain-specific jargon. You need real-world evaluation data.
Infrastructure complexity should be proportional. Two Mac Minis connected by Ethernet cable, running three Python services as launchd daemons, with Tailscale for remote access. That's the entire infrastructure. It would take an afternoon to set up from scratch. If your ML infrastructure requires a platform team to maintain, it's probably over-engineered for your actual needs.