Mirai Labs: Machine Learning Engineer, Model Optimization

About us:

At Mirai Labs, we are building an open-source foundational layer for locally hosted, private and user-aligned AI. Our Model Optimization team is focused on pushing the envelope of local LLM inference performance by means of clever algorithmic tricks and post-training model modifications

The role:

We’re hiring a Machine Learning Engineer to work on model optimization pipelines in tight collaboration with our systems engineers.

This is a hybrid research and engineering role, we expect the candidate to come up with new ideas, conduct experiments and build production-grade pipelines.

All of the work will be released as open-source.

Explore all roles

What you’ll do:

You will work on applying post-training modifications to compress and accelerate the existing Large Language Models. The goal is to push the Pareto frontier across latency, accuracy and memory consumption for on-device inference engines.

The job involves analyzing the full inference stack, from hardware to software, looking for opportunities to remove bottlenecks and increase utilization, implementing model compression algorithms and training acceleration adapters.

What we’re looking for in a candidate:

Solid knowledge of fundamentals of Machine Learning, Linear Algebra and Statistics
Ability to write maintainable Python code
Familiarity with LLM inference and standard LLM optimization techniques
High-level undestanding of GPU architecture and programming model

Nice to have:

Systems engineering experience, especially GPU/NPU programming
Previous experience working with LLM inference
Familiarity with JAX

Some simple problems representative of what we deal with on a daily basis:

We have a K-FAC/Shampoo-style Kronecker-factorized approximation of the Fisher information matrix for a linear layer. We have quantized the weights and got a residual r = W - quant(W). Derive a closed-form expression for a LoRA adapter which compensates for the error introduced by quantization.

Fast Walsh-Hadamard Transform is a Fourier-like orthogonal transformation that is often used to remove outliers from a vector. For large vectors it is more efficient to apply it block-wise instead of transforming the entire vector. Which block sizes would be the most convinient for a fast GPU implementation?
For small models with large vocabularies the naive implementation of top-p sampling via logit sorting can be prohibitively slow. Can you design a more efficient and GPU-friendly sampling scheme?

Why join?

You’ll work on applied research that directly impacts how AI systems operate in real-world environments, not just benchmarks.

We are a fast-paced horizontal team with a lot of autonomy and trust.

We value technical depth and fast iteration. Competitive compensation + meaningful equity.

Why us?

Founded by proven entrepreneurs who built and scaled consumer AI leaders like Reface (300M users, a pioneer in Generative photo/video AI) and Prisma (100M MAU, a pioneer in on device AI photo enhancement).

Our team is small (16 people), senior, and deeply technical. We ship fast and own problems end-to-end. We’re advised by a former Apple Distinguished Engineer who worked on MLX, and backed by leading AI-focused funds and individuals.

Interested?

Join a small, senior team, building the full on-device stack to achieve realtime local intelligence

Apply

About us:

At Mirai Labs, we are building an open-source foundational layer for locally hosted, private and user-aligned AI. Our Model Optimization team is focused on pushing the envelope of local LLM inference performance by means of clever algorithmic tricks and post-training model modifications

The role:

We’re hiring a Machine Learning Engineer to work on model optimization pipelines in tight collaboration with our systems engineers.

This is a hybrid research and engineering role, we expect the candidate to come up with new ideas, conduct experiments and build production-grade pipelines.

All of the work will be released as open-source.

Explore all roles

What you’ll do:

You will work on applying post-training modifications to compress and accelerate the existing Large Language Models. The goal is to push the Pareto frontier across latency, accuracy and memory consumption for on-device inference engines.

The job involves analyzing the full inference stack, from hardware to software, looking for opportunities to remove bottlenecks and increase utilization, implementing model compression algorithms and training acceleration adapters.

What we’re looking for in a candidate:

Solid knowledge of fundamentals of Machine Learning, Linear Algebra and Statistics
Ability to write maintainable Python code
Familiarity with LLM inference and standard LLM optimization techniques
High-level undestanding of GPU architecture and programming model

Nice to have:

Systems engineering experience, especially GPU/NPU programming
Previous experience working with LLM inference
Familiarity with JAX

Some simple problems representative of what we deal with on a daily basis:

We have a K-FAC/Shampoo-style Kronecker-factorized approximation of the Fisher information matrix for a linear layer. We have quantized the weights and got a residual r = W - quant(W). Derive a closed-form expression for a LoRA adapter which compensates for the error introduced by quantization.

Fast Walsh-Hadamard Transform is a Fourier-like orthogonal transformation that is often used to remove outliers from a vector. For large vectors it is more efficient to apply it block-wise instead of transforming the entire vector. Which block sizes would be the most convinient for a fast GPU implementation?
For small models with large vocabularies the naive implementation of top-p sampling via logit sorting can be prohibitively slow. Can you design a more efficient and GPU-friendly sampling scheme?

Why join?

You’ll work on applied research that directly impacts how AI systems operate in real-world environments, not just benchmarks.

We are a fast-paced horizontal team with a lot of autonomy and trust.

We value technical depth and fast iteration. Competitive compensation + meaningful equity.

Why us?

Founded by proven entrepreneurs who built and scaled consumer AI leaders like Reface (300M users, a pioneer in Generative photo/video AI) and Prisma (100M MAU, a pioneer in on device AI photo enhancement).

Our team is small (16 people), senior, and deeply technical. We ship fast and own problems end-to-end. We’re advised by a former Apple Distinguished Engineer who worked on MLX, and backed by leading AI-focused funds and individuals.

Interested?

Join a small, senior team, building the full on-device stack to achieve realtime local intelligence.

Apply

Main

Company

Links

Platform / SDK

Inference Runtime

Models Conversion

Models Library

MacOS App

Blog

Docs

About us

Careers

Contact Us

Privacy Policy

Terms of Use

X (Twitter)

Github

Discord