Research

Models library

Inference

Products

Docs

Company

The full vertical stack for on-device AI

Mirai labs builds the inference runtime, trains the models, and optimizes for the hardware ...

Read our research

Talk to us

New Chat
Chats
Models
Benchmarks
Docs
Schedule team sync
Settings
Schedule team syncPowered by Mirai • Offline
Book a meeting with my team for tomorrow at 3pm and notify them on Slack.
Thought for 1.4s — planning tool sequence
Calledcalendar.create(date="tomorrow", time="15:00", title="Team sync")
Calledcalendar.invite(event_id="…", recipients="team")
Calledslack.send(channel="#team", message="Meeting confirmed for tomorrow 3pm")
Executing on-device...

3 tool calls · 1,238 t/s · Qwen3-0.6B · W8A8 · M4 Pro · ttft 71ms
Mirai-1.1-0.6B
Eject modelMirai-1.1-0.6B
~
Why on-device?

On-device AI is inevitable.

Latency.

A network round-trip has a floor you cannot engineer away.

Privacy.

Cloud AI privacy is a policy. On-device privacy is architecture.

Cost at scale.

Per-token cost is invisible at low volume. At millions of daily users it becomes the dominant cost line.

Hardware.

Modern devices run 38 TOPS of neural compute. That silicon exists, is paid for, and sits idle.

The on-device leg, memory-bound, hardware-native, built from scratch, is the unsolved problem. Most labs have never started there. Mirai does.

The shift.

Mirai starts where cloud labs won't.

Most labs treat on-device as a compression problem. Taking the cloud model, making it smaller, shipping it. Mirai starts from the hardware constraint up.

The device already has the silicon.
The missing piece is the stack.

Where most labs end up
Cloud model
Compress & quantize
Optimize
Deploy on device
Where Mirai starts
Hardware constraint
Memory budget
Architecture
Model
Our target:

1,000 tokens per second on-device.

That is the number where the assistant stops returning text and starts being the interface.

At 1,000 t/s, the assistant doesn't return text. It renders interfaces, fills forms, executes multi-step workflows, and resolves integrations in real time, locally, privately. Below that threshold you have a chatbot. Above it you have a new computing primitive.

Full-stack sovereignty

Mirai owns the full stack to reach it.

Model Intelligence

Block diffusion

Speculative routing

Per-layer n-gram embeddings

Built for memory-bound decoding

Inference Runtime

MPSGraph kernels

W8A8 + vector-quantised weights

ASTC zero-overhead loading

Metal-native execution, no CoreML overhead

Hardware Optimization

On-device execution of AI actions

Local context, memory, and state

Deterministic, low-latency decision loops

Works offline, syncs with cloud when needed

You cannot achieve 1,000 t/s through someone else's inference engine designed for different constraints, running someone else's model optimized for cloud arithmetic intensity, on hardware whose accelerators you can't fully address. Every layer must be co-designed. That is what we are doing.

Architecture

Full control of the inference stack is what gives us a unique advantage in the on-device AI game. We have the freedom to tailor the model to the hardware

Inference Engine.

Portable runtimes (llama.cpp, MLX) average across hardware. They cannot address the Neural Engine without the CoreML compiler in the way. The W8A8 regime, where Apple accelerators peak, is left unused.

Rust-native engine on MPSGraph

Direct ANE dispatch

W8A8 + int8 storage + 2–4 bit vector quantization

ASTC-codec repurposed for zero-overhead weight dequant at load time

user %

Block diffusion & speculative routing.

Autoregressive decoding is memory-bandwidth starved on-device. Standard MoE routes per-token, doubling memory pressure. Every decoding step wastes bandwidth.

Block diffusion with block size aligned to ANE width no full retraining.

Speculative routing predicts expert activation from prior-block states, enabling disk offload with prefetch overlap.

Per-layer n-gram embeddings reduce vocabulary footprint.

>

W8A8 + vector quantization.

Naive int4 quantization destroys quality. Cloud-style quant-aware training is impractical for open weights. Calibration-only methods drift on long contexts.

W8A8 with SpinQuant-style rotations nearly lossless at int8.

2–4 bit vector quantization for weight storage.

Hardware texture decompression (ASTC) repurposed for zero-copy dequant at load time.

>
Numbers:

What Apple Silicon delivers today with Mirai.

We publish on the hardest problems in on-device inference.

Recent research articles:

Speculative routing for Block-MoE inference.

Predict expert activation from prior-block states.

W8A8+VQ hybrid: near-lossless 2–4 bit compression.

SpinQuant-style rotation + vector quantiser for GEMM kernels.

ASTC codec for zero-copy weight loading.

Hardware texture decompression repurposed for neural weights

Block diffusion on Apple Neural Engine.

Aligning block size to ANE width for max throughput.

Active research areas:

Block diffusion

Self-speculation

Speculative routing

Block-MoE

SpinQuant + VQ

ASTC kernels

Layer repetition

Roadmap

What Mirai ships today, builds next, and is aiming for.

Inference runtime

Now

Mirai's own models

Next 3 – 5 months

1,000 t/s

Vision

Want to work on unsolved problems in on-device AI?

Open roles:

Machine Learning Engineer

Remote / SF / Europe Models Optimization

Machine Learning Engineer

Remote / SF / Europe Models & Research

Inference engineer

Remote / SF / Europe

on device ai research lab is building ...

Intelligence for the edge.

Read our research

Talk to us