Remote / SF / Europe
Full Time
Models Optimization
Machine Learning Engineer
Model Optimization
Machine Learning Engineer, Model Optimization
Join a small, senior team building the fastest on-device AI inference engine. Powering real products, not demos
Join a small, senior team building the fastest on-device AI inference engine. Powering real products, not demos.
About us:
At Mirai, we are building an open-source foundational layer for locally hosted, private and user-aligned AI.
Our Model Optimization team is focused on pushing the envelope of local LLM inference performance by means of clever algorithmic tricks and post-training model modifications
The role:
We’re hiring a Machine Learning Engineer to work on model optimization pipelines in tight collaboration with our systems engineers.
This is a hybrid research and engineering role, we expect the candidate to come up with new ideas, conduct experiments and build production-grade pipelines.
All of the work will be released as open-source.
What you’ll do:
You will work on applying post-training modifications to compress and accelerate the existing Large Language Models. The goal is to push the Pareto frontier across latency, accuracy and memory consumption for on-device inference engines.
The job involves analyzing the full inference stack, from hardware to software, looking for opportunities to remove bottlenecks and increase utilization, implementing model compression algorithms and training acceleration adapters.
What we’re looking for in a candidate:
Solid knowledge of fundamentals of Machine Learning, Linear Algebra and Statistics
Ability to write maintainable Python code
Familiarity with LLM inference and standard LLM optimization techniques
High-level undestanding of GPU architecture and programming model
Nice to have:
Systems engineering experience, especially GPU/NPU programming
Previous experience working with LLM inference
Familiarity with JAX
Some simple problems representative of what we deal with on a daily basis:
We have a K-FAC/Shampoo-style Kronecker-factorized approximation of the Fisher information matrix for a linear layer. We have quantized the weights and got a residual
r = W - quant(W). Derive a closed-form expression for a LoRA adapter which compensates for the error introduced by quantization.
Fast Walsh-Hadamard Transform is a Fourier-like orthogonal transformation that is often used to remove outliers from a vector. For large vectors it is more efficient to apply it block-wise instead of transforming the entire vector. Which block sizes would be the most convinient for a fast GPU implementation?
For small models with large vocabularies the naive implementation of top-p sampling via logit sorting can be prohibitively slow. Can you design a more efficient and GPU-friendly sampling scheme?
Why join?
You’ll work on applied research that directly impacts how AI systems operate in real-world environments, not just benchmarks.
We are a fast-paced horizontal team with a lot of autonomy and trust.
We value technical depth and fast iteration. Competitive compensation + meaningful equity.
Why us?
Founded by proven entrepreneurs who built and scaled consumer AI leaders like Reface (200M+ users) and Prisma (100M+ users).
Our team is small (14 people), senior, and deeply technical. We ship fast and own problems end-to-end.
We’re advised by a former Apple Distinguished Engineer who worked on MLX, and backed by leading AI-focused funds and individuals.

Interested?
Join a small, senior team building the fastest on-device AI inference engine. Powering real products, not demos
About us:
At Mirai, we are building an open-source foundational layer for locally hosted, private and user-aligned AI.
Our Model Optimization team is focused on pushing the envelope of local LLM inference performance by means of clever algorithmic tricks and post-training model modifications
The role:
We’re hiring a Machine Learning Engineer to work on model optimization pipelines in tight collaboration with our systems engineers.
This is a hybrid research and engineering role, we expect the candidate to come up with new ideas, conduct experiments and build production-grade pipelines.
All of the work will be released as open-source.
What you’ll do:
You will work on applying post-training modifications to compress and accelerate the existing Large Language Models. The goal is to push the Pareto frontier across latency, accuracy and memory consumption for on-device inference engines.
The job involves analyzing the full inference stack, from hardware to software, looking for opportunities to remove bottlenecks and increase utilization, implementing model compression algorithms and training acceleration adapters.
What we’re looking for in a candidate:
Solid knowledge of fundamentals of Machine Learning, Linear Algebra and Statistics
Ability to write maintainable Python code
Familiarity with LLM inference and standard LLM optimization techniques
High-level undestanding of GPU architecture and programming model
Nice to have:
Systems engineering experience, especially GPU/NPU programming
Previous experience working with LLM inference
Familiarity with JAX
Some simple problems representative of what we deal with on a daily basis:
We have a K-FAC/Shampoo-style Kronecker-factorized approximation of the Fisher information matrix for a linear layer. We have quantized the weights and got a residual
r = W - quant(W). Derive a closed-form expression for a LoRA adapter which compensates for the error introduced by quantization.
Fast Walsh-Hadamard Transform is a Fourier-like orthogonal transformation that is often used to remove outliers from a vector. For large vectors it is more efficient to apply it block-wise instead of transforming the entire vector. Which block sizes would be the most convinient for a fast GPU implementation?
For small models with large vocabularies the naive implementation of top-p sampling via logit sorting can be prohibitively slow. Can you design a more efficient and GPU-friendly sampling scheme?
Why join?
You’ll work on applied research that directly impacts how AI systems operate in real-world environments, not just benchmarks.
We are a fast-paced horizontal team with a lot of autonomy and trust.
We value technical depth and fast iteration. Competitive compensation + meaningful equity.
Why us?
Founded by proven entrepreneurs who built and scaled consumer AI leaders like Reface (200M+ users) and Prisma (100M+ users).
Our team is small (14 people), senior, and deeply technical. We ship fast and own problems end-to-end.
We’re advised by a former Apple Distinguished Engineer who worked on MLX, and backed by leading AI-focused funds and individuals.

Interested?
Join a small, senior team building the fastest on-device AI inference engine. Powering real products, not demos.