LFM2.5-1.2B-Thinking is a compact language model designed for on-device deployment with 1.2 billion parameters. It builds on the LFM2 architecture with extended pre-training up to 28 trillion tokens and large-scale reinforcement learning, achieving best-in-class performance for its size while rivaling much larger models. The model features a 32,768-token context length and supports eight languages including English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. The model excels at fast edge inference with extremely low memory requirements, running under 1GB of memory and delivering 239 tokens per second on AMD CPUs and 82 tokens per second on mobile NPUs. It has day-one support for multiple inference frameworks including llama.cpp, MLX, and vLLM. LFM2.5-1.2B-Thinking incorporates a hybrid architecture combining double-gated LIV convolution blocks with GQA blocks, making it particularly effective for agentic tasks, data extraction, and retrieval-augmented generation applications.
LFM2.5-1.2B-Thinking is a compact language model designed for on-device deployment with 1.2 billion parameters. It builds on the LFM2 architecture with extended pre-training up to 28 trillion tokens and large-scale reinforcement learning, achieving best-in-class performance for its size while rivaling much larger models. The model features a 32,768-token context length and supports eight languages including English, Arabic, Chinese, French, German, Japanese, Korean, and Spanish. The model excels at fast edge inference with extremely low memory requirements, running under 1GB of memory and delivering 239 tokens per second on AMD CPUs and 82 tokens per second on mobile NPUs. It has day-one support for multiple inference frameworks including llama.cpp, MLX, and vLLM. LFM2.5-1.2B-Thinking incorporates a hybrid architecture combining double-gated LIV convolution blocks with GQA blocks, making it particularly effective for agentic tasks, data extraction, and retrieval-augmented generation applications.