The full vertical stack for on-device AI
Read our research
Talk to us

Below 1,000 tokens per second, you have a chatbot. Above it, you have a new computing primitive.
Below the threshold, the user waits on a cursor. Above it, the model can think, call tools, and draw UI faster than a human can blink.
Mirai owns every layer required to cross that line.
Mirai starts where cloud labs won't.
Most labs treat on-device as a compression problem. Taking the cloud model, making it smaller, shipping it. Mirai starts from the hardware constraint up.
The device already has the silicon.
The missing piece is the stack.

1,000 tokens per second on-device.
That is the number where the assistant stops returning text and starts being the interface.
At 1,000 t/s, the assistant doesn't return text. It renders interfaces, fills forms, executes multi-step workflows, and resolves integrations in real time, locally, privately. Below that threshold you have a chatbot. Above it you have a new computing primitive.
Mirai owns the full stack to reach it.
Model Intelligence
Block diffusion
Speculative routing
Per-layer n-gram embeddings
Built for memory-bound decoding
Inference Runtime
MPSGraph kernels
W8A8 + vector-quantised weights
ASTC zero-overhead loading
Metal-native execution, no CoreML overhead
Hardware Optimization
On-device execution of AI actions
Local context, memory, and state
Deterministic, low-latency decision loops
Works offline, syncs with cloud when needed
You cannot achieve 1,000 t/s through someone else's inference engine designed for different constraints, running someone else's model optimized for cloud arithmetic intensity, on hardware whose accelerators you can't fully address. Every layer must be co-designed. That is what we are doing.
The 3 bets Mirai is making.
Inference Engine
Rust-native
MPSGraph for ANE access without CoreML overhead
Targets W8A8 regime where Apple's neural accelerators peak
W8A8
peak ANE regime
2–4 bit
VQ compressed weights
Block diffusion & speculative routing
W8A8 → int8 storage → 2–4 bit vector quantisation
SpinQuant-style rotation makes W8A8 nearly lossless
ASTC codec investigation for zero-overhead dequant
Arithmetic intensity target
I = FLOPs / bytes_DRAM
↑ block size → ↑ I during diffusion
↑ VQ ratio → ↓ bytes_DRAM
W8A8 + vector quantization
W8A8 → int8 storage → 2–4 bit vector quantisation
SpinQuant-style rotation makes W8A8 nearly lossless
ASTC codec investigation for zero-overhead dequant
Arithmetic intensity target
I = FLOPs / bytes_DRAM
↑ block size → ↑ I during diffusion
↑ VQ ratio → ↓ bytes_DRAM
What Apple Silicon delivers today with Mirai.
Mirai-1B
Block size aligned to M-series ANE width. No full retraining.
Block-Diffusion
Mirai-3B-MoE
Block-sparse experts with speculative routing. Prefetch overlap.
Sparse
Mirai-Embed
Per-layer Engram-style embeddings. Richer context.
N-Gram
We support most popular architectures.
Optimized for peak performance.
We publish on the hardest problems in on-device inference.
Recent research articles:
Speculative routing for Block-MoE inference.
Predict expert activation from prior-block states.
W8A8+VQ hybrid: near-lossless 2–4 bit compression.
SpinQuant-style rotation + vector quantiser for GEMM kernels.
ASTC codec for zero-copy weight loading.
Hardware texture decompression repurposed for neural weights
Block diffusion on Apple Neural Engine.
Aligning block size to ANE width for max throughput.
Active research areas:
Block diffusion
Self-speculation
Speculative routing
Block-MoE
SpinQuant + VQ
ASTC kernels
Layer repetition
What Mirai ships today, builds next, and is aiming for.
Inference runtime.
Convert, optimize, distribute, and run any open model on Apple Silicon. The fastest engine on the platform. SDK published for iOS and macOS.
Mirai's own models — not compressed cloud.
Convert, optimize, distribute, and run any open model on Apple Silicon. The fastest engine on the platform. SDK published for iOS and macOS.
1,000 t/s as the new computing primitive.
Convert, optimize, distribute, and run any open model on Apple Silicon. The fastest engine on the platform. SDK published for iOS and macOS.
You cannot achieve 1,000 t/s through someone else's inference engine designed for different constraints, running someone else's model optimized for cloud arithmetic intensity, on hardware whose accelerators you can't fully address. Every layer must be co-designed. That is what we are doing.
Why on-device AI needs its own lab?
Full-stack sovereignty.
Zero architectural choices imposed by 3d-parties. Tailoring model to hardware.
1.5B Apple Silicon users.
The largest compute substrate in history, underutilised for AI inference.
Privacy by architecture.
On-device inference means data never leaves the user's device.
Want to work on unsolved problems in on-device AI?
Open roles:
Machine Learning Engineer
Remote / SF / Europe • Models Optimization
Machine Learning Engineer
Remote / SF / Europe • Models & Research
Inference engineer
Remote / SF / Europe
Intelligence for the edge.
Read our research
Talk to us