Mirai labs: Intelligence that lives on the device.

Research

Models library

Inference

Products

Docs

Company

1455

Mirai is building the inference runtime, trains the models, and optimizes for the hardware ...

The full vertical stack for on-device AI

Read our research

Talk to us

New Chat

Chats

Models

Benchmarks

Docs

Schedule team sync

Settings

On-device InferencePowered by Mirai • Offline

Book a meeting with my team for tomorrow at 3pm and notify them on Slack.

Thought for 1.4s — planning tool sequence

▶Calledcalendar.create(date="tomorrow", time="15:00", title="Team sync")

▶Calledcalendar.invite(event_id="…", recipients="team")

▶Calledslack.send(channel="#team", message="Meeting confirmed for tomorrow 3pm")

Executing on-device...

3 tool calls · 1,238 t/s · Qwen3-0.6B · W8A8 · M4 Pro · ttft 71ms

Mirai-1.1-0.6B

Eject modelMirai-1.1-0.6B

~ ❯ 

Below 1,000 tokens per second, you have a chatbot. Above it, you have a new computing primitive.

Below 1,000 tok/s

1,000 tok/s

Above 1,000 tok/s

prompt

Tool

JSON

Retrieve

Action

ui.tsx

Summary · generated

42 tokens · on-device

Run

1,000 tok/s

Below the threshold, the user waits on a cursor. Above it, the model can think, call tools, and draw UI faster than a human can blink.

Mirai owns every layer required to cross that line.

The shift.

Mirai starts where cloud labs won't.

Most labs treat on-device as a compression problem. Taking the cloud model, making it smaller, shipping it. Mirai starts from the hardware constraint up.

Details

The device already has the silicon.
The missing piece is the stack.

Where most labs end up

Cloud model

Compress & quantize

Optimize

Deploy on device

Where Mirai starts

Hardware constraint

Memory budget

Architecture

Model

Our target

1,000 tokens per second on-device.

That is the number where the assistant stops returning text and starts being the interface.

At 1,000 t/s, the assistant doesn't return text. It renders interfaces, fills forms, executes multi-step workflows, and resolves integrations in real time, locally, privately. Below that threshold you have a chatbot. Above it you have a new computing primitive.

Full-stack sovereignty

Mirai owns the full stack to reach it.

Model Intelligence

Block diffusion

Speculative routing

Per-layer n-gram embeddings

Built for memory-bound decoding

Inference Runtime

MPSGraph kernels

W8A8 + vector-quantised weights

ASTC zero-overhead loading

Metal-native execution, no CoreML overhead

Hardware Optimization

On-device execution of AI actions

Local context, memory, and state

Deterministic, low-latency decision loops

Works offline, syncs with cloud when needed

You cannot achieve 1,000 t/s through someone else's inference engine designed for different constraints, running someone else's model optimized for cloud arithmetic intensity, on hardware whose accelerators you can't fully address. Every layer must be co-designed. That is what we are doing.

Architecture

The 3 bets Mirai is making.

Inference Engine

Rust-native

MPSGraph for ANE access without CoreML overhead

Targets W8A8 regime where Apple's neural accelerators peak

W8A8

peak ANE regime

2–4 bit

VQ compressed weights

Learn more

user % 

Block diffusion & speculative routing

W8A8 → int8 storage → 2–4 bit vector quantisation

SpinQuant-style rotation makes W8A8 nearly lossless

ASTC codec investigation for zero-overhead dequant

Arithmetic intensity target

I = FLOPs / bytes_DRAM
↑ block size → ↑ I during diffusion
↑ VQ ratio → ↓ bytes_DRAM

Learn more

>

W8A8 + vector quantization

W8A8 → int8 storage → 2–4 bit vector quantisation

SpinQuant-style rotation makes W8A8 nearly lossless

ASTC codec investigation for zero-overhead dequant

Arithmetic intensity target

I = FLOPs / bytes_DRAM
↑ block size → ↑ I during diffusion
↑ VQ ratio → ↓ bytes_DRAM

Learn more

>

Numbers

What Apple Silicon delivers today with Mirai.

Not scaled-down cloud models. Architectures designed around memory bandwidth and ANE throughput.

Mirai-1B

Block size aligned to M-series ANE width. No full retraining.

Block-Diffusion

Mirai-3B-MoE

Block-sparse experts with speculative routing. Prefetch overlap.

Sparse

Mirai-Embed

Per-layer Engram-style embeddings. Richer context.

N-Gram

Learn more

We support most popular architectures.
Optimized for peak performance.

Partner

LFM

LiquidAI

GPT-OSS

OpenAI

Qwen 3

Alibaba

Gemma-3

Google

Llama-3.2

We publish on the hardest problems in on-device inference.

Recent research articles:

Speculative routing for Block-MoE inference.

Predict expert activation from prior-block states.

W8A8+VQ hybrid: near-lossless 2–4 bit compression.

SpinQuant-style rotation + vector quantiser for GEMM kernels.

ASTC codec for zero-copy weight loading.

Hardware texture decompression repurposed for neural weights

Block diffusion on Apple Neural Engine.

Aligning block size to ANE width for max throughput.

Explore all articles

Active research areas:

Block diffusion

Self-speculation

Speculative routing

Block-MoE

SpinQuant + VQ

ASTC kernels

Layer repetition

Roadmap

What Mirai ships today, builds next, and is aiming for.

Inference runtime.

Convert, optimize, distribute, and run any open model on Apple Silicon. The fastest engine on the platform. SDK published for iOS and macOS.

Mirai's own models — not compressed cloud.

Convert, optimize, distribute, and run any open model on Apple Silicon. The fastest engine on the platform. SDK published for iOS and macOS.

1,000 t/s as the new computing primitive.

Convert, optimize, distribute, and run any open model on Apple Silicon. The fastest engine on the platform. SDK published for iOS and macOS.

Why on-device AI needs its own lab?

We own the model, own the stack, own the hardware abstractions.

Full-stack sovereignty.

Zero architectural choices imposed by 3d-parties. Tailoring model to hardware.

1.5B Apple Silicon users.

The largest compute substrate in history, underutilised for AI inference.

Privacy by architecture.

On-device inference means data never leaves the user's device.

Want to work on unsolved problems in on-device AI?

Open roles:

Careers page

Machine Learning Engineer

Remote / SF / Europe • Models Optimization

Machine Learning Engineer

Remote / SF / Europe • Models & Research

Inference engineer

Remote / SF / Europe

on device ai research lab is building ...

Intelligence for the edge.

Read our research

Talk to us

Research

Products

Company

Links

Papers

Benchmarks

Blog

Models Library

Inference SDK

MacOS App

Docs

About us

Careers

X (Twitter)

Github

Discord

Below 1,000 tokens per second, you have a chatbot. Above it, you have a new computing primitive.

Mirai starts where cloud labs won't.

1,000 tokens per second on-device.

Mirai owns the full stack to reach it.

Model Intelligence

Inference Runtime

Hardware Optimization

The 3 bets Mirai is making.

Inference Engine

W8A8

2–4 bit

Block diffusion & speculative routing

W8A8 + vector quantization

What Apple Silicon delivers today with Mirai.

Mirai-1B

Mirai-3B-MoE

Mirai-Embed

We support most popular architectures. Optimized for peak performance.

We publish on the hardest problems in on-device inference.

What Mirai ships today, builds next, and is aiming for.

Inference runtime.

Mirai's own models — not compressed cloud.

1,000 t/s as the new computing primitive.

Why on-device AI needs its own lab?

Want to work on unsolved problems in on-device AI?

We support most popular architectures.
Optimized for peak performance.