Fine-tune on your conversations. Serve at 111 tok/s. No cloud, no API keys, no data leaves your Mac.
Three backends, all achieving real-time voice targets. Measured on M4 Max with Gemma 4 E4B.
| Backend | TTFT P50 | TTFT P95 | TPS P50 | TPS Mean | Verdict |
|---|---|---|---|---|---|
| MLX Server (mlx_lm) | 154ms | 889ms | 111.6 | 111.4 | REAL-TIME |
| Ollama (Go + llama.cpp) | 141ms | 150ms | 107.9 | 102.7 | REAL-TIME |
| llama.cpp (Metal) | 136ms | 141ms | 94.0 | 94.1 | REAL-TIME |
From data extraction to real-time serving — a complete pipeline for personalizing Gemma 4.
Extract conversations from iMessage, Facebook Messenger, WhatsApp, or any JSONL source. Voice-optimized filtering built in.
SFT + DPO pipeline with PLE-safe quantization. Train all three model targets (E4B, E2B, 31B) with a single command.
MLX, Ollama, llama.cpp, vLLM Metal, and an experimental ANE+GPU bridge. All OpenAI-compatible.
E2B draft model proposes tokens, E4B target verifies in parallel. Train both on the same data for maximum acceptance rate.
3-bit KV cache compression — 4.6x smaller with ~2% quality loss. Critical for long conversations on constrained devices.
No cloud APIs. No data upload. Extraction, training, and inference all happen on your Mac. Your conversations stay yours.
From raw conversation data to a personalized model running at voice speed.
Pull conversations from iMessage, Facebook, or any platform
Combine, deduplicate, and split into training data
LoRA training on Gemma 4 (5-15 minutes on Apple Silicon)
Real-time inference at 111+ tok/s with your personalized model
Six layers of undocumented hardware and APIs we discovered, benchmarked, and proved work — all running on your Mac right now.
Undocumented CPU matrix coprocessor. 77x faster than NEON, 2.5 TFLOPS FP32. Every transformer matmul uses this.
Discovered _ANEClient with 46 methods, _ANEModel (52), _ANEInMemoryModel (41). 16-core dedicated neural accelerator.
Shared memory across CPU, GPU, and ANE with no memcpy. 5+ TB/s effective bandwidth. The secret to hybrid pipelines.
MTLFunctionConstant specialization compiles purpose-built GPU shaders for each model config. Full loop unrolling.
GPU prefill + decode with IOSurface KV cache. 1,333 tok/s demonstrated — 53x real-time margin.
Every layer built, tested, and proven on M4 Max. Full benchmark suite with automated report generation.
Install, extract, fine-tune, and serve. That's it.
Unified memory means zero-copy GPU access to model weights. More memory = bigger models.