Intelligence that lives on the device
The model, the inference stack, and the hardware abstractions. Full-stack sovereignty over on-device AI for Apple Silicon.
Read our research
View our models
Talk to us
We own the full stack of on-device AI
Model Intelligence.
Block diffusion
Speculative routing
Per-layer n-gram embeddings
Architectures built for memory-bound decoding
Inference Engine.
MPSGraph kernels
W8A8 + vector-quantised weights
ASTC zero-overhead loading
Metal-native execution, no CoreML overhead
Agentic Infrastructure.
On-device execution of AI actions
Local context, memory, and state
Deterministic, low-latency decision loops
Works offline, syncs with cloud when needed
We publish on the hardest problems in on-device inference
Preprint
Speculative routing for Block-MoE inference.
Predict expert activation from prior-block states.
Preprint
W8A8+VQ hybrid: near-lossless 2–4 bit compression.
SpinQuant-style rotation + vector quantiser for GEMM kernels.
Blog post
ASTC codec for zero-copy weight loading.
Hardware texture decompression repurposed for neural weights
Blog post
Block diffusion on Apple Neural Engine.
Aligning block size to ANE width for max throughput.
Active research areas
Block diffusion
Self-speculation
Speculative routing
Block-MoE
SpinQuant + VQ
ASTC kernels
Layer repetition
We built our models. Trained for on-device deployment from the ground up
Block-Diffusion
Mirai-1B
Block size aligned to M-series ANE width. Autoregressive conversion, no full retraining.
Sparse
Mirai-3B-MoE
Block-sparse experts with speculative routing. Disk offload with prefetch overlap.
N-gram
Mirai-Embed
Per-layer Engram-style embeddings. Reduced vocabulary × richer context.
Not scaled-down cloud models. Architectures designed around memory bandwidth and ANE throughput.
We support the most popular architectures
Our inference stack
Inference Engine
Rust-native
MPSGraph for ANE access without CoreML overhead
Targets W8A8 regime where Apple's neural accelerators peak
W8A8
peak ANE regime
2–4 bit
VQ compressed weights
Quantisation Research
W8A8 → int8 storage → 2–4 bit vector quantisation
SpinQuant-style rotation makes W8A8 nearly lossless
ASTC codec investigation for zero-overhead dequant
Arithmetic intensity target
I = FLOPs / bytes_DRAM
↑ block size → ↑ I during diffusion
↑ VQ ratio → ↓ bytes_DRAM
Why on-device AI needs its own lab?
Full-stack sovereignty.
0 architectural choices imposed by third-party frameworks. Tailoring model to the hardware.
1.5B Apple Silicon users
The largest homogeneous compute substrate in history, underutilised for AI inference.
Privacy by architecture.
On-device inference means data never leaves the user's device.
Many innovative model architectures fail to gain adoption because inference stacks can't support them.
We exist to close that gap: own the model, own the stack, own the hardware abstractions.
Want to work on unsolved problems in on-device AI?
Open roles:
Machine Learning Engineer
Remote / SF / Europe • Full Time • Models Optimization
Machine Learning Engineer
Remote / SF / Europe • Full Time • Models & Research
Inference engineer
Remote / SF / Europe • Full Time
Quantisation, speculative decoding, novel architectures. Small team, high ownership.
Intelligence for the edge.
Read our research
View our models
Speak with us