Mirai labs: Intelligence that lives on the device.

Research

Models library

Inference

Products

Docs

Company

1455

on device ai research lab is working ...

The assistant should already be the
interface. We're building what's missing

Instantaneous. Private. No network dependency. Mirai builds the models, runtime, and quantization stack that brings frontier AI capability to the hardware billions of people already own

Read our research

Talk to us

New Chat

Chats

Models

Benchmarks

Docs

Schedule team sync

Settings

Schedule team syncPowered by Mirai • Offline

Book a meeting with my team for tomorrow at 3pm and notify them on Slack.

Thought for 1.4s — planning tool sequence

▸calendar.create(date="tomorrow", time="15:00", title="Team sync")

▸calendar.invite(event_id="…", recipients="team")

▸slack.send(channel="#team", message="Meeting confirmed for tomorrow 3pm")

3 tool calls · 1,238 t/s · Qwen3-0.6B · W8A8 · M4 Pro · ttft 71ms

Mirai-1.1-0.6B

Eject modelMirai-1.1-0.6B

The real problem

Nobody wants to wait.
And right now, everyone does

Today's AI agents

Thinking… please wait.

You ask. It starts working. Two minutes pass. You check. It's done something wrong. It says: sorry, let me try again. Five more minutes.

What should exist

Instant.
Every time.

You say what you need. The result appears before you finish the sentence. No round-trip. No spinner. No "let me check on that."

This is what on-device AI makes possible. If it's fast enough.

What this looks like

Say what you need.
Your device handles the rest

No app to open.
No form to navigate.
No network required.
The assistant resolves it.

This doesn't exist yet. The AI capable of doing it needs to be fast enough to be the UI itself. Not a tool you wait for.

See the technical details

Full-stack sovereignty

Mirai owns the full stack to reach it

Model Intelligence

Inference Runtime

Architecture

Full control of the inference stack is what gives us a unique advantage in the on-device AI game. We have the freedom to tailor the model to the hardware

Inference Engine.

Portable runtimes (llama.cpp, MLX) average across hardware. They cannot address the Neural Engine without the CoreML compiler in the way. The W8A8 regime, where Apple accelerators peak, is left unused.

Rust-native engine on MPSGraph

Direct ANE dispatch

W8A8 + int8 storage + 2–4 bit vector quantization

ASTC-codec repurposed for zero-overhead weight dequant at load time

Learn more

user % 

Block diffusion & speculative routing.

Autoregressive decoding is memory-bandwidth starved on-device. Standard MoE routes per-token, doubling memory pressure. Every decoding step wastes bandwidth.

Block diffusion with block size aligned to ANE width — no full retraining.

Speculative routing predicts expert activation from prior-block states, enabling disk offload with prefetch overlap.

Per-layer n-gram embeddings reduce vocabulary footprint.

Learn more

>

W8A8 + vector quantization.

Naive int4 quantization destroys quality. Cloud-style quant-aware training is impractical for open weights. Calibration-only methods drift on long contexts.

W8A8 with SpinQuant-style rotations — nearly lossless at int8.

2–4 bit vector quantization for weight storage.

Hardware texture decompression (ASTC) repurposed for zero-copy dequant at load time.

Learn more

>

Numbers:

What Apple Silicon delivers today with Mirai.

We support most popular architectures.
Optimized for peak performance.

Partner

LFM

LiquidAI

GPT-OSS

OpenAI

Qwen 3

Alibaba

Gemma-3

Google

Llama-3.2

We publish on the hardest problems in on-device inference.

Recent research articles:

Speculative routing for Block-MoE inference.

Predict expert activation from prior-block states.

W8A8+VQ hybrid: near-lossless 2–4 bit compression.

SpinQuant-style rotation + vector quantiser for GEMM kernels.

ASTC codec for zero-copy weight loading.

Hardware texture decompression repurposed for neural weights

Block diffusion on Apple Neural Engine.

Aligning block size to ANE width for max throughput.

Explore all articles

Active research areas:

Block diffusion

Self-speculation

Speculative routing

Block-MoE

SpinQuant + VQ

ASTC kernels

Layer repetition

Roadmap

What Mirai ships today, builds next, and is aiming for.

Inference runtime

Now

Mirai's own models

Next 3 – 5 months

1,000 t/s

Vision

Want to work on unsolved problems in on-device AI?

Open roles:

Careers page

Machine Learning Engineer

Remote / SF / Europe • Models Optimization

Machine Learning Engineer

Remote / SF / Europe • Models & Research

Inference engineer

Remote / SF / Europe

on device ai research lab is building ...

Intelligence for the edge.

Read our research

Talk to us

Research

Products

Company

Links

Papers

Benchmarks

Blog

Models Library

Inference SDK

MacOS App

Docs

About us

Careers

X (Twitter)

Github

Discord

Nobody wants to wait.And right now, everyone does

Thinking… please wait.

Instant. Every time.

Say what you need.Your device handles the rest

No app to open. No form to navigate. No network required. The assistant resolves it.

Mirai owns the full stack to reach it

Model Intelligence

Inference Runtime

Full control of the inference stack is what gives us a unique advantage in the on-device AI game. We have the freedom to tailor the model to the hardware

Inference Engine.

Block diffusion & speculative routing.

W8A8 + vector quantization.

What Apple Silicon delivers today with Mirai.

We support most popular architectures. Optimized for peak performance.

We publish on the hardest problems in on-device inference.

What Mirai ships today, builds next, and is aiming for.

Want to work on unsolved problems in on-device AI?

Nobody wants to wait.
And right now, everyone does

Instant.
Every time.

Say what you need.
Your device handles the rest

No app to open.
No form to navigate.
No network required.
The assistant resolves it.

We support most popular architectures.
Optimized for peak performance.