Part 1: Introduction to Deploying LLMs on Mobile

By

Artur Chakhvadze

Mar 22, 2025

Introduction

The emergence of Large Language Models (LLMs) as general-purpose language processing tools in recent years has created a surge in demand for AI tools within consumer apps. While LLM-based chatbots and assistants are the most ubiquitous and widely discussed examples of AI systems, they represent only the tip of the iceberg. A significant portion of the demand for AI tools comes from the need to extract structured data from texts or to integrate "common sense" into application logic.

Previously, such processing required complex, purpose-built machine learning models and heuristic algorithms. Modern LLM-based solutions achieve much higher quality and are trivial to implement. Just five years ago, tasks like translating natural language prompts into database queries or converting handwritten formulas into LaTeX code required extensive research and development. Now, they can be implemented through a single request to an LLM service.

We believe that in the near future, most interactions involving LLMs will not involve humans.

Instead, AI will become a "binding layer", gluing together various APIs and software subsystems within applications. Unfortunately, cloud-based AI services are not well-suited for this role. While frontier cloud models offer unmatched intelligence, they are expensive and raise privacy concerns. Consider a messaging app that needs to extract calendar events from chats. Even ignoring privacy issues, the costs associated with processing every message by an AI model make such features economically infeasible using cloud AI services. However, these tasks do not require the intelligence of a frontier LLM and can be adequately handled by models small enough to run on a smartphone.

Unfortunately, the current ecosystem of models and tools for running AI workflows on mobile devices significantly lags behind its cloud counterpart.

There are no reliable high-level LLM frameworks available for iOS and Android, and most mainstream inference engines are inefficient, poorly documented, and require specialized knowledge.

At Mirai, we are creating an integrated set of tools enabling mobile developers to easily integrate local AI into their apps. We are developing a custom inference engine optimized for mobile hardware to achieve maximum inference speed.

In this series of posts, we will discuss the hardware and software stacks of modern mobile platforms, covering nuances of efficiently running modern LLMs. We will begin the series with a hardware overview of Apple's flagship device, the iPhone 16 Pro.

Introduction

The emergence of Large Language Models (LLMs) as general-purpose language processing tools in recent years has created a surge in demand for AI tools within consumer apps. While LLM-based chatbots and assistants are the most ubiquitous and widely discussed examples of AI systems, they represent only the tip of the iceberg. A significant portion of the demand for AI tools comes from the need to extract structured data from texts or to integrate "common sense" into application logic.

Previously, such processing required complex, purpose-built machine learning models and heuristic algorithms. Modern LLM-based solutions achieve much higher quality and are trivial to implement. Just five years ago, tasks like translating natural language prompts into database queries or converting handwritten formulas into LaTeX code required extensive research and development. Now, they can be implemented through a single request to an LLM service.

We believe that in the near future, most interactions involving LLMs will not involve humans.

Instead, AI will become a "binding layer", gluing together various APIs and software subsystems within applications. Unfortunately, cloud-based AI services are not well-suited for this role. While frontier cloud models offer unmatched intelligence, they are expensive and raise privacy concerns. Consider a messaging app that needs to extract calendar events from chats. Even ignoring privacy issues, the costs associated with processing every message by an AI model make such features economically infeasible using cloud AI services. However, these tasks do not require the intelligence of a frontier LLM and can be adequately handled by models small enough to run on a smartphone.

Unfortunately, the current ecosystem of models and tools for running AI workflows on mobile devices significantly lags behind its cloud counterpart.

There are no reliable high-level LLM frameworks available for iOS and Android, and most mainstream inference engines are inefficient, poorly documented, and require specialized knowledge.

At Mirai, we are creating an integrated set of tools enabling mobile developers to easily integrate local AI into their apps. We are developing a custom inference engine optimized for mobile hardware to achieve maximum inference speed.

In this series of posts, we will discuss the hardware and software stacks of modern mobile platforms, covering nuances of efficiently running modern LLMs. We will begin the series with a hardware overview of Apple's flagship device, the iPhone 16 Pro.

Next articles:

Other articles to read:

Try Mirai – AI which run directly on your devices, bringing powerful capabilities closer to where decisions are made.

Hassle-free app integration, lightning-fast inference, reliable structured outputs