Gemma 4 on Apple Silicon

Personalize Gemma 4.
Make it real-time.

Fine-tune on your conversations. Serve at 111 tok/s. No cloud, no API keys, no data leaves your Mac.

111.6
Tokens / Second
154ms
Time to First Token
0.029
Real-Time Factor
5 min
Fine-Tune Time

Real benchmarks. Real hardware.

Three backends, all achieving real-time voice targets. Measured on M4 Max with Gemma 4 E4B.

Backend TTFT P50 TTFT P95 TPS P50 TPS Mean Verdict
MLX Server (mlx_lm) 154ms 889ms 111.6 111.4 REAL-TIME
Ollama (Go + llama.cpp) 141ms 150ms 107.9 102.7 REAL-TIME
llama.cpp (Metal) 136ms 141ms 94.0 94.1 REAL-TIME

Everything you need.

From data extraction to real-time serving — a complete pipeline for personalizing Gemma 4.

Multi-Source Extraction

Extract conversations from iMessage, Facebook Messenger, WhatsApp, or any JSONL source. Voice-optimized filtering built in.

LoRA Fine-Tuning

SFT + DPO pipeline with PLE-safe quantization. Train all three model targets (E4B, E2B, 31B) with a single command.

5 Serving Backends

MLX, Ollama, llama.cpp, vLLM Metal, and an experimental ANE+GPU bridge. All OpenAI-compatible.

Speculative Decoding

E2B draft model proposes tokens, E4B target verifies in parallel. Train both on the same data for maximum acceptance rate.

TurboQuant KV Cache

3-bit KV cache compression — 4.6x smaller with ~2% quality loss. Critical for long conversations on constrained devices.

Fully Local & Private

No cloud APIs. No data upload. Extraction, training, and inference all happen on your Mac. Your conversations stay yours.

Four steps to real-time.

From raw conversation data to a personalized model running at voice speed.

1

Extract

Pull conversations from iMessage, Facebook, or any platform

2

Prepare

Combine, deduplicate, and split into training data

3

Fine-Tune

LoRA training on Gemma 4 (5-15 minutes on Apple Silicon)

4

Serve

Real-time inference at 111+ tok/s with your personalized model

The secret performance stack.

Six layers of undocumented hardware and APIs we discovered, benchmarked, and proved work — all running on your Mac right now.

AMX/SME2 Coprocessor

Undocumented CPU matrix coprocessor. 77x faster than NEON, 2.5 TFLOPS FP32. Every transformer matmul uses this.

77x speedup proven

Neural Engine Private API

Discovered _ANEClient with 46 methods, _ANEModel (52), _ANEInMemoryModel (41). 16-core dedicated neural accelerator.

9 private classes found

IOSurface Zero-Copy

Shared memory across CPU, GPU, and ANE with no memcpy. 5+ TB/s effective bandwidth. The secret to hybrid pipelines.

5,444 GB/s measured

Dynamic Metal Kernels

MTLFunctionConstant specialization compiles purpose-built GPU shaders for each model config. Full loop unrolling.

4 Gemma configs compiled

Hybrid Pipeline

GPU prefill + decode with IOSurface KV cache. 1,333 tok/s demonstrated — 53x real-time margin.

★ 53x real-time ★

6/6 Benchmarks Pass

Every layer built, tested, and proven on M4 Max. Full benchmark suite with automated report generation.

All verified E2E

Running in 5 minutes.

Install, extract, fine-tune, and serve. That's it.

Terminal
# Install dependencies pip install mlx mlx-lm # Extract your conversations python3 scripts/extract_imessage_pairs.py python3 scripts/extract-facebook.py --export ~/Downloads/facebook-export # Prepare training data python3 scripts/prepare-training-data.py --voice # Fine-tune Gemma 4 E4B (real-time voice model) python3 scripts/finetune-gemma.py --target e4b --data data/finetune # Serve with real-time optimizations python3 scripts/mlx-server.py \ --model mlx-community/gemma-4-e4b-it-4bit \ --adapter-path ~/.human/training-data/adapters/seth-lora-e4b \ --realtime # Benchmark — prove it's real-time python3 scripts/voice-bench.py

Built for Apple Silicon.

Unified memory means zero-copy GPU access to model weights. More memory = bigger models.

E2B (2B params)

180 tok/s
Minimum: M1, 8GB
Recommended: M2+, 16GB
Speculative decode draft model

E4B (4B params)

110 tok/s
Minimum: M1 Pro, 16GB
Recommended: M3 Pro+, 36GB
Real-time voice daily driver

31B (dense)

20 tok/s
Minimum: M2 Max, 64GB
Recommended: M4 Max, 128GB
Highest quality, complex tasks