
Introducing Mirai Quantization: Redefining the speed-quality frontier for local LLMs on Apple silicon.
Introducing Mirai Quantization: Redefining the speed-quality frontier for local LLMs on Apple silicon.
By Artur Chakhvadze, Ryan Mathieu, Roman Knyazhitskiy, Nikolai Voinilenko, Chen-Chen Yeh, Artur Mullakhmetov, Eugene Bokhan, in collaboration with others at Mirai Labs.
Jun 4, 2026
We are excited to release our first quantized checkpoints for the Qwen3.5 series of models. Our co-designed quantization and inference-engine stack dramatically outperforms Unsloth/llama.cpp and MLX on the precision-vs-generation-speed tradeoff, delivering 40–60% more tokens per second at the same quality level across Apple devices ranging from smartphones to high-end desktops.
This is the first of many upcoming releases from our Model Optimization project, and a small part of a larger engineering effort to bring capable fully local agents running at over 1,000 tokens per second to consumer hardware. Many of the choices we made here are strategic: the quantization format was carefully chosen to maximize hardware utilization when used together with our future speculative decoding pipeline.
We are starting with 4-bit (Mirai-M) and 8-bit (Mirai-L) versions of Qwen3.5-0.8B, Qwen3.5-2B, and Qwen3.5-4B. Larger Qwen3.5/Qwen3.6 models, as well as the Gemma 4 family, are coming soon. Notably, on top-tier chips, our 8-bit models run faster than llama.cpp models quantized to 4-bit, while remaining virtually identical to the original unquantized bf16 versions.
Results
Our M checkpoints have the size and memory requirements comparable to "3-bit" Unsloth/Llama.cpp versions, while achieving much higher MMLU-Pro and KL scores, and outperforming all llama.cpp checkpoints by 30-50% in terms of decoding speed across all devices.
Our L checkpoints are statistically indistinguishable from the full-precision versions, yet they achieve decoding speed comparable to 4- or 5-bit llama.cpp versions.
Format
We chose a conservative 4-bit asymmetric integer quantization format with 4-bit zero points and bf16 scales for our 4-bit checkpoints, and 8-bit symmetric quantization for the 8-bit versions. We use block-diagonal Random Hadamard Transforms with a block size of 32 as pre- and post-processing steps to suppress outliers.
The RHT block size was chosen to match the GPU warp width. This allows pre- and post-processing to be implemented using warp-level shuffle operations and fused into the prologues and epilogues of RMSNorm and GEMM/GEMV kernels, incurring zero additional memory traffic and only a minimal performance penalty.
The integer quantization format, while suboptimal from a rate-distortion perspective, can be implemented extremely efficiently on the GPU. When combined with dynamic 8-bit activation quantization, it can also use the hardware-accelerated int8 GEMM path for high-performance speculative decoding on Apple M5+ chips.
We experimented with advanced vector-quantizer designs and found that, while they can easily achieve much better rate-distortion characteristics, their GPU implementations suffer from shared-memory congestion and bank conflicts when reading from lookup tables. This, along with the desire to use the int8 GEMM path on neural accelerators, convinced us to stick with integer quantization.
For similar reasons, we avoid “intermediate” 3-, 5-, and 6-bit formats, despite the potential for considerable quality gains from fine-grained per-layer precision allocation.
Algorithm
We use a two-stage approach: Post-Training Quantization followed by a lightweight Quantization-Aware Distillation step for accuracy recovery.
For Post-Training Quantization, we use a custom JAX implementation of the YAQA second-order quantization algorithm, with minor modifications that improve numerical stability. We use an internal calibration dataset to estimate the Fisher Information Matrix.
We follow the PTQ step with a small quantization-aware distillation run on a dataset of rollouts generated by the unquantized teacher model. We use Muon combined with a deterministic straight-through estimator to optimize the quantized weights, and AdamW to optimize the group scales and zero points.
We found the quality of the distillation step to be highly sensitive to the quality and domain coverage of the dataset, and we are working on improving and expanding it. We expect significant further quality improvements as we continue to iterate on our QAD recipe.
Lalamo, our custom model-optimization framework, provides highly efficient abstractions that allow us to easily experiment with novel quantization formats and optimization algorithms, and to apply them to any LLM architecture out of the box.
Evaluation
The two metrics we focused on are MMLU-Pro score with thinking enabled and an 81,920-token reasoning budget, and the mean Kullback-Leibler divergence between the logits of the unquantized model and the quantized model.
We computed KL scores on a set of 100 randomly sampled questions sampled from the Lmsys-Chat-1M dataset. Answers to those questions were generated with the unquantized model. We computed the mean KL divergence between teacher and student logits over the generated answers. The Lmsys-Chat-1M dataset was not used during either PTQ or QAD stages.
Notably, the scores for most 8-bit quantizations from any provider generally fall below the noise floor of these evaluations, indicating that these evals do not provide a way to distinguish the quality of different 8-bit quantizations. We estimated the noise floor as the mean KL divergence between a llama.cpp full-precision model and our full-precision checkpoints. In this case, the only differences are due to numerical precision and reduction-order, yielding approximately 8 × 10⁻⁴ mean KL.
The issue with KL evaluation is that, while it is extremely precise, it is plausible for a good quantization to have a large KL. It also does not detect long-run performance degradation. Therefore, we also use MMLU-Pro evaluations, ran with lm-evaluation-harness by EleutherAI, as an additional anchor.
We used a row-wise binomial parametric bootstrap to estimate a lower bound on the standard deviation of a single MMLU-Pro score. This lower bound is around 0.3 percentage points, making any differences below 1 percentage point statistically insignificant. This also explains how Unsloth’s 6-bit quantization can outperform its 8-bit quantization on MMLU-Pro.
Our 4-bit Qwen3.5-0.8B and Qwen3.5-2B checkpoints are comparable to Q4_K_M checkpoints in quality, while being about 20% smaller. The Qwen3.5-4B checkpoints are between Q3_K_M and Q4_0 in both size and quality. Our 8-bit checkpoints are indistinguishable in quality from MLX 8-bit and Unsloth UD-Q8_K_XL.
Try our quantized models
Model | Size | Quantization | Runs, M4 Pro | Resident Memory |
|---|---|---|---|---|
4B | Mirai Large | ~ 48 tok/s | Under 4.75 GB | |
4B | Mirai Medium | ~ 84 tok/s | Under 2.71 GB | |
2B | Mirai Large | ~ 108 tok/s | Under 2.22 GB | |
2B | Mirai Medium | ~ 171 tok/s | Under 1.32 GB | |
0.8B | Mirai Large | ~ 222 tok/s | Under 1.01 GB | |
0.8B | Mirai Medium | ~ 314 tok/s | Under 0.65 GB |
Future Work
In the coming weeks, we will release quantized checkpoints for larger Qwen3.5/Qwen3.6 models, including the 27B and 35B-A3B variants, as well as the Gemma 4 family of models.
Next, we will release our own take on diffusion-based speculative decoding, which will unlock the full potential of our inference architecture. Additionally, we are experimenting with novel approaches to 2- and 3-bit quantization, which would allow us to efficiently run 300B-parameter MoE models on high-end consumer hardware.
By Artur Chakhvadze, Ryan Mathieu, Roman Knyazhitskiy, Nikolai Voinilenko, Chen-Chen Yeh, Artur Mullakhmetov, Eugene Bokhan, in collaboration with others at Mirai Labs.
Jun 4, 2026
We are excited to release our first quantized checkpoints for the Qwen3.5 series of models. Our co-designed quantization and inference-engine stack dramatically outperforms Unsloth/llama.cpp and MLX on the precision-vs-generation-speed tradeoff, delivering 40–60% more tokens per second at the same quality level across Apple devices ranging from smartphones to high-end desktops.
This is the first of many upcoming releases from our Model Optimization project, and a small part of a larger engineering effort to bring capable fully local agents running at over 1,000 tokens per second to consumer hardware. Many of the choices we made here are strategic: the quantization format was carefully chosen to maximize hardware utilization when used together with our future speculative decoding pipeline.
We are starting with 4-bit (Mirai-M) and 8-bit (Mirai-L) versions of Qwen3.5-0.8B, Qwen3.5-2B, and Qwen3.5-4B. Larger Qwen3.5/Qwen3.6 models, as well as the Gemma 4 family, are coming soon. Notably, on top-tier chips, our 8-bit models run faster than llama.cpp models quantized to 4-bit, while remaining virtually identical to the original unquantized bf16 versions.
Results
Our M checkpoints have the size and memory requirements comparable to "3-bit" Unsloth/Llama.cpp versions, while achieving much higher MMLU-Pro and KL scores, and outperforming all llama.cpp checkpoints by 30-50% in terms of decoding speed across all devices.
Our L checkpoints are statistically indistinguishable from the full-precision versions, yet they achieve decoding speed comparable to 4- or 5-bit llama.cpp versions.
Format
We chose a conservative 4-bit asymmetric integer quantization format with 4-bit zero points and bf16 scales for our 4-bit checkpoints, and 8-bit symmetric quantization for the 8-bit versions. We use block-diagonal Random Hadamard Transforms with a block size of 32 as pre- and post-processing steps to suppress outliers.
The RHT block size was chosen to match the GPU warp width. This allows pre- and post-processing to be implemented using warp-level shuffle operations and fused into the prologues and epilogues of RMSNorm and GEMM/GEMV kernels, incurring zero additional memory traffic and only a minimal performance penalty.
The integer quantization format, while suboptimal from a rate-distortion perspective, can be implemented extremely efficiently on the GPU. When combined with dynamic 8-bit activation quantization, it can also use the hardware-accelerated int8 GEMM path for high-performance speculative decoding on Apple M5+ chips.
We experimented with advanced vector-quantizer designs and found that, while they can easily achieve much better rate-distortion characteristics, their GPU implementations suffer from shared-memory congestion and bank conflicts when reading from lookup tables. This, along with the desire to use the int8 GEMM path on neural accelerators, convinced us to stick with integer quantization.
For similar reasons, we avoid “intermediate” 3-, 5-, and 6-bit formats, despite the potential for considerable quality gains from fine-grained per-layer precision allocation.
Algorithm
We use a two-stage approach: Post-Training Quantization followed by a lightweight Quantization-Aware Distillation step for accuracy recovery.
For Post-Training Quantization, we use a custom JAX implementation of the YAQA second-order quantization algorithm, with minor modifications that improve numerical stability. We use an internal calibration dataset to estimate the Fisher Information Matrix.
We follow the PTQ step with a small quantization-aware distillation run on a dataset of rollouts generated by the unquantized teacher model. We use Muon combined with a deterministic straight-through estimator to optimize the quantized weights, and AdamW to optimize the group scales and zero points.
We found the quality of the distillation step to be highly sensitive to the quality and domain coverage of the dataset, and we are working on improving and expanding it. We expect significant further quality improvements as we continue to iterate on our QAD recipe.
Lalamo, our custom model-optimization framework, provides highly efficient abstractions that allow us to easily experiment with novel quantization formats and optimization algorithms, and to apply them to any LLM architecture out of the box.
Evaluation
The two metrics we focused on are MMLU-Pro score with thinking enabled and an 81,920-token reasoning budget, and the mean Kullback-Leibler divergence between the logits of the unquantized model and the quantized model.
We computed KL scores on a set of 100 randomly sampled questions sampled from the Lmsys-Chat-1M dataset. Answers to those questions were generated with the unquantized model. We computed the mean KL divergence between teacher and student logits over the generated answers. The Lmsys-Chat-1M dataset was not used during either PTQ or QAD stages.
Notably, the scores for most 8-bit quantizations from any provider generally fall below the noise floor of these evaluations, indicating that these evals do not provide a way to distinguish the quality of different 8-bit quantizations. We estimated the noise floor as the mean KL divergence between a llama.cpp full-precision model and our full-precision checkpoints. In this case, the only differences are due to numerical precision and reduction-order, yielding approximately 8 × 10⁻⁴ mean KL.
The issue with KL evaluation is that, while it is extremely precise, it is plausible for a good quantization to have a large KL. It also does not detect long-run performance degradation. Therefore, we also use MMLU-Pro evaluations, ran with lm-evaluation-harness by EleutherAI, as an additional anchor.
We used a row-wise binomial parametric bootstrap to estimate a lower bound on the standard deviation of a single MMLU-Pro score. This lower bound is around 0.3 percentage points, making any differences below 1 percentage point statistically insignificant. This also explains how Unsloth’s 6-bit quantization can outperform its 8-bit quantization on MMLU-Pro.
Our 4-bit Qwen3.5-0.8B and Qwen3.5-2B checkpoints are comparable to Q4_K_M checkpoints in quality, while being about 20% smaller. The Qwen3.5-4B checkpoints are between Q3_K_M and Q4_0 in both size and quality. Our 8-bit checkpoints are indistinguishable in quality from MLX 8-bit and Unsloth UD-Q8_K_XL.
Try our quantized models
Model | Size | Quantization | Runs, M4 Pro | Resident Memory |
|---|---|---|---|---|
4B | Mirai Large | ~ 48 tok/s | Under 4.75 GB | |
4B | Mirai Medium | ~ 84 tok/s | Under 2.71 GB | |
2B | Mirai Large | ~ 108 tok/s | Under 2.22 GB | |
2B | Mirai Medium | ~ 171 tok/s | Under 1.32 GB | |
0.8B | Mirai Large | ~ 222 tok/s | Under 1.01 GB | |
0.8B | Mirai Medium | ~ 314 tok/s | Under 0.65 GB |
Future Work
In the coming weeks, we will release quantized checkpoints for larger Qwen3.5/Qwen3.6 models, including the 27B and 35B-A3B variants, as well as the Gemma 4 family of models.
Next, we will release our own take on diffusion-based speculative decoding, which will unlock the full potential of our inference architecture. Additionally, we are experimenting with novel approaches to 2- and 3-bit quantization, which would allow us to efficiently run 300B-parameter MoE models on high-end consumer hardware.