gemma-realtime

Performance

Real benchmarks. Real hardware.

Three backends, all achieving real-time voice targets. Measured on M4 Max with Gemma 4 E4B.

Backend	TTFT P50	TTFT P95	TPS P50	TPS Mean	Verdict
MLX Server (mlx_lm)	154ms	889ms	111.6	111.4	REAL-TIME
Ollama (Go + llama.cpp)	141ms	150ms	107.9	102.7	REAL-TIME
llama.cpp (Metal)	136ms	141ms	94.0	94.1	REAL-TIME

Capabilities

Everything you need.

From data extraction to real-time serving — a complete pipeline for personalizing Gemma 4.

Multi-Source Extraction

Extract conversations from iMessage, Facebook Messenger, WhatsApp, or any JSONL source. Voice-optimized filtering built in.

LoRA Fine-Tuning

SFT + DPO pipeline with PLE-safe quantization. Train all three model targets (E4B, E2B, 31B) with a single command.

5 Serving Backends

MLX, Ollama, llama.cpp, vLLM Metal, and an experimental ANE+GPU bridge. All OpenAI-compatible.

Speculative Decoding

E2B draft model proposes tokens, E4B target verifies in parallel. Train both on the same data for maximum acceptance rate.

TurboQuant KV Cache

3-bit KV cache compression — 4.6x smaller with ~2% quality loss. Critical for long conversations on constrained devices.

Fully Local & Private

No cloud APIs. No data upload. Extraction, training, and inference all happen on your Mac. Your conversations stay yours.

How It Works

Four steps to real-time.

From raw conversation data to a personalized model running at voice speed.

Extract

Pull conversations from iMessage, Facebook, or any platform

Prepare

Combine, deduplicate, and split into training data

Fine-Tune

LoRA training on Gemma 4 (5-15 minutes on Apple Silicon)

Serve

Real-time inference at 111+ tok/s with your personalized model

Reverse Engineered

The secret performance stack.

Six layers of undocumented hardware and APIs we discovered, benchmarked, and proved work — all running on your Mac right now.

AMX/SME2 Coprocessor

Undocumented CPU matrix coprocessor. 77x faster than NEON, 2.5 TFLOPS FP32. Every transformer matmul uses this.

77x speedup proven

Neural Engine Private API

Discovered _ANEClient with 46 methods, _ANEModel (52), _ANEInMemoryModel (41). 16-core dedicated neural accelerator.

9 private classes found

IOSurface Zero-Copy

Shared memory across CPU, GPU, and ANE with no memcpy. 5+ TB/s effective bandwidth. The secret to hybrid pipelines.

5,444 GB/s measured

Dynamic Metal Kernels

MTLFunctionConstant specialization compiles purpose-built GPU shaders for each model config. Full loop unrolling.

4 Gemma configs compiled

Hybrid Pipeline

GPU prefill + decode with IOSurface KV cache. 1,333 tok/s demonstrated — 53x real-time margin.

★ 53x real-time ★

6/6 Benchmarks Pass

Every layer built, tested, and proven on M4 Max. Full benchmark suite with automated report generation.

All verified E2E

Get Started

Running in 5 minutes.

Install, extract, fine-tune, and serve. That's it.

Terminal

# Install dependencies
pip install mlx mlx-lm

# Extract your conversations
python3 scripts/extract_imessage_pairs.py
python3 scripts/extract-facebook.py --export ~/Downloads/facebook-export

# Prepare training data
python3 scripts/prepare-training-data.py --voice

# Fine-tune Gemma 4 E4B (real-time voice model)
python3 scripts/finetune-gemma.py --target e4b --data data/finetune

# Serve with real-time optimizations
python3 scripts/mlx-server.py \
  --model mlx-community/gemma-4-e4b-it-4bit \
  --adapter-path ~/.human/training-data/adapters/seth-lora-e4b \
  --realtime

# Benchmark — prove it's real-time
python3 scripts/voice-bench.py
            

Hardware

Built for Apple Silicon.

Unified memory means zero-copy GPU access to model weights. More memory = bigger models.

E2B (2B params)

180 tok/s

Minimum: M1, 8GB
Recommended: M2+, 16GB
Speculative decode draft model

E4B (4B params)

110 tok/s

Minimum: M1 Pro, 16GB
Recommended: M3 Pro+, 36GB
Real-time voice daily driver

31B (dense)

20 tok/s

Minimum: M2 Max, 64GB
Recommended: M4 Max, 128GB
Highest quality, complex tasks

Personalize Gemma 4. Make it real-time.