Qwen3-0.6B-MLX-4bit from Alibaba – Run On-Device with Mirai.

Qwen3-0.6B-MLX-4bit

Run locally Apple devices with Mirai

Run on device

Type

Local

From

Alibaba

Quantisation

uint4

Precision

No

Size

0.6B

Source

Qwen3-0.6B-MLX-4bit is a 600 million parameter language model from the latest Qwen3 series, optimized for inference on Apple Silicon using the MLX framework with 4-bit quantization. It is a causal language model with 28 layers, 16 attention heads, and supports a 32,768 token context length. The model offers a unique capability to seamlessly switch between thinking mode for complex logical reasoning, mathematics, and coding tasks, and non-thinking mode for efficient general-purpose dialogue. In thinking mode, the model generates reasoning chains wrapped in XML-style tags before producing final responses, similar to its larger sibling QwQ-32B. The model demonstrates significant improvements in reasoning capabilities, human preference alignment for creative writing and role-playing, agent capabilities for tool integration, and support for over 100 languages with strong multilingual instruction-following abilities. Qwen3-0.6B can be controlled to enable or disable thinking behavior both through hard switches during prompt formatting and soft switches using user commands like /think and /no_think within conversations. The model is designed to work with the latest versions of transformers and mlx_lm libraries and supports agentic use cases through tool-calling capabilities.

Explore all local models

Qwen3-0.6B-MLX-4bit

Run locally Apple devices with Mirai

Run on device

Type

Local

From

Alibaba

Quantisation

uint4

Precision

float16

Size

0.6B

Source

Qwen3-0.6B-MLX-4bit is a 600 million parameter language model from the latest Qwen3 series, optimized for inference on Apple Silicon using the MLX framework with 4-bit quantization. It is a causal language model with 28 layers, 16 attention heads, and supports a 32,768 token context length. The model offers a unique capability to seamlessly switch between thinking mode for complex logical reasoning, mathematics, and coding tasks, and non-thinking mode for efficient general-purpose dialogue. In thinking mode, the model generates reasoning chains wrapped in XML-style tags before producing final responses, similar to its larger sibling QwQ-32B. The model demonstrates significant improvements in reasoning capabilities, human preference alignment for creative writing and role-playing, agent capabilities for tool integration, and support for over 100 languages with strong multilingual instruction-following abilities. Qwen3-0.6B can be controlled to enable or disable thinking behavior both through hard switches during prompt formatting and soft switches using user commands like /think and /no_think within conversations. The model is designed to work with the latest versions of transformers and mlx_lm libraries and supports agentic use cases through tool-calling capabilities.

Explore all local models

Main

Company

Links

Platform / SDK

Models Library

MacOS App

Blog

Docs

About us

Careers

Contact Us

Privacy Policy

Terms of Use

X (Twitter)

Github

Discord

Platform / SDK

Models Library

MacOS App

Blog

Docs

Main

About us

Careers

Contact Us

Privacy Policy

Terms of Use

Company

X (Twitter)

Github

Discord

Links

Main

Company

Links

Platform / SDK

Models Library

MacOS App

Blog

Docs

About us

Careers

Contact Us

Privacy Policy

Terms of Use

X (Twitter)

Github

Discord