Inference Engineer

Inference Engineer

Join a small, senior team, building the full on-device stack to achieve realtime local intelligence

Join a small, senior team, building the full on-device stack to achieve realtime local intelligence.

Remote / SF / Europe

Full Time

The role

We're looking for engineers who can help us build the software that makes modern llms run efficiently on-device.

You'll primarily work on uzu, our inference engine.

  • Implementing new model architectures,

  • Optimizing kernels,

  • Supporting new modalities,

  • Adding new backends,

  • Building a wide range of features such as KV cache paging and continuous batching.

We are the frontier on-device AI lab.

We build the models, inference runtime, and quantization stack. From the device constraint up. So AI can run at full capability on the hardware billions of people already own. Our stack spans from low-level GPU kernels to high-level model conversion tools. We're a small team obsessed with performance, working at the intersection of systems programming and machine learning research.

We encourage you to apply if you deeply understand at least one of:

  • How computers work.

  • How modern language models work.

and have experience with at least one of:

  • Writing high-performance GPU kernels.

  • Rust systems programming.

  • Implementing LLM architectures outside of high-level frameworks.

  • High-quality open source contributions.

We welcome applications from very talented students and early-career engineers.

Why us?

Founded by proven entrepreneurs who built and scaled consumer AI leaders like Reface (300M users) and Prisma (100M MAU).

Our team is small (16 people), senior, and deeply technical. We ship fast and own problems end-to-end.

We’re advised by a former Apple Distinguished Engineer who worked on MLX, and backed by leading AI-focused funds and individuals.

Interested?

Join a small, senior team, building the full on-device stack to achieve realtime local intelligence

The role

We're looking for engineers who can help us build the software that makes modern llms run efficiently on-device.

You'll primarily work on uzu, our inference engine.

  • Implementing new model architectures,

  • Optimizing kernels,

  • Supporting new modalities,

  • Adding new backends,

  • Building a wide range of features such as KV cache paging and continuous batching.

We are the frontier on-device AI lab.

We build the models, inference runtime, and quantization stack. From the device constraint up. So AI can run at full capability on the hardware billions of people already own. Our stack spans from low-level GPU kernels to high-level model conversion tools. We're a small team obsessed with performance, working at the intersection of systems programming and machine learning research.

We encourage you to apply if you deeply understand at least one of:

  • How computers work.

  • How modern language models work.

and have experience with at least one of:

  • Writing high-performance GPU kernels.

  • Rust systems programming.

  • Implementing LLM architectures outside of high-level frameworks.

  • High-quality open source contributions.

We welcome applications from very talented students and early-career engineers.

Why us?

Founded by proven entrepreneurs who built and scaled consumer AI leaders like Reface (300M users) and Prisma (100M MAU).

Our team is small (16 people), senior, and deeply technical. We ship fast and own problems end-to-end.

We’re advised by a former Apple Distinguished Engineer who worked on MLX, and backed by leading AI-focused funds and individuals.

Interested?

Join a small, senior team, building the full on-device stack to achieve realtime local intelligence.