Part 4: Brief history of Apple ML Stack

By

Mirai team

Mar 24, 2025

Apple's journey from simple math libraries to a full-blown AI powerhouse is quite the ride. Over the years, they've built an impressive ecosystem of hardware and software that runs machine learning right on your devices - prioritizing speed, battery life, and (of course) your privacy. Let's dive into how they pulled this off and what makes Apple's approach to AI unique in today's landscape.

The Early Days

It all started with the Accelerate framework - basically a collection of super-optimized math functions that leveraged CPU vector instructions to crunch numbers efficiently. This laid the groundwork for what would eventually become Apple's ML strategy.

Within Accelerate, Apple rolled out Basic Neural Network Subroutines (BNNS) - one of their first real attempts at dedicated ML tools. BNNS gave developers functions to build neural networks for both training and inference, optimized for Apple's CPUs. It included all the neural network essentials like convolutions, pooling, and fully connected layers, though it was pretty limited by today's standards.

Back then, everything ran on the CPU since Apple hadn't yet built specialized ML hardware. They were focused on getting traditional algorithms to run well on general-purpose processors - setting the stage for what was coming next.

Shifting to GPU Power

As neural networks got more complex and hungry for compute, Apple realized they needed to tap into the parallel processing power of GPUs. So they expanded Metal (originally a graphics API) into a computing platform for non-graphics tasks, including machine learning.

Metal's compute shaders gave developers GPU access for general calculations, which was perfect for accelerating the parallel operations that neural networks love. This was huge for convolutional neural networks (CNNs) that were taking over computer vision tasks, as they needed tons of parallel matrix operations that GPUs could handle much better than CPUs.

Apple's vertical integration strategy - controlling everything from chips to operating systems - was a major advantage here. Unlike competitors relying on third-party hardware and software, Apple could make system-level optimizations others couldn't match Apple's Custom GPU. For ML specifically, this meant tight integration between hardware, low-level software, and high-level frameworks.

Metal Performance Shaders

To squeeze even more ML performance from GPUs, Apple introduced Metal Performance Shaders (MPS) - a library of GPU-accelerated primitives optimized for image processing, linear algebra, and machine learning. MPS included pre-optimized kernels for common ML operations that delivered major performance gains over generic implementations.

By 2017, developers were actively comparing BNNS (CPU) against MPS CNN (GPU) for neural network inference. These benchmarks helped figure out which tech worked best for different use cases. Generally, GPUs crushed it for larger models with lots of parallelism, while CPUs held their own for smaller networks where transferring data to the GPU wasn't worth the trouble.

MPS marked an important step in Apple's ML strategy by providing finely-tuned implementations that fully leveraged their GPU architecture. This let developers build more sophisticated ML features while keeping performance and battery life in check.

Core ML

In 2017, Apple introduced Core ML, a high-level framework designed to make integrating ML models into apps dead simple. Core ML handled all the complexity of model inference, automatically picking the best hardware (CPU, GPU, or later the Neural Engine) for optimal performance.

Core ML was built with on-device processing as a core principle - everything happens right on your device with no need to send data to servers. As Apple's docs put it:

"Core ML models run strictly on the user's device and remove any need for a network connection, keeping your app responsive and your users' data private." This matched Apple's privacy-first approach while also enabling ML features even without internet access.

The framework supported all kinds of models - neural networks, tree ensembles, SVMs, you name it. For developers, Core ML made it super easy to convert models trained in TensorFlow or PyTorch into an optimized format for Apple devices. This democratized ML, letting developers without specialized expertise add smart features to their apps.

Apple Neural Engine

Apple's ML strategy took a massive leap forward with the Apple Neural Engine (ANE) in the A11 Bionic chip, which powered the iPhone X in 2017. The ANE was a total paradigm shift - dedicated hardware specifically designed to accelerate neural networks with crazy efficiency.

The ANE is basically a Neural Processing Unit (NPU) that specializes in operations like convolutions and matrix multiplications - the bread and butter of modern neural networks. It's like how GPUs accelerate graphics, but the ANE is laser-focused on neural network computations rather than general matrix math.

This specialized hardware delivered dramatic improvements in both speed and battery life. Models optimized for the ANE could run circles around the same models on CPU or GPU, while sipping power. This enabled way more sophisticated on-device ML without killing your battery - crucial for mobile devices.

As described in Apple's docs about running transformers on the Neural Engine:

"This architecture helps enable experiences such as panoptic segmentation in Camera with HyperDETR, on-device scene analysis in Photos, image captioning for accessibility, machine translation, and many others". The Neural Engine has gotten beefier with each generation of Apple silicon, appearing in iPhones, iPads, and now Macs.

But the ANE isn't perfect - it comes with limitations. "Not every Core ML model can make full use of the ANE". It works best with specific operations and architectures, so developers need to understand its quirks and sometimes redesign models to get the most out of it.

Metal Performance Shaders Graph

As ML models got more complex, Apple extended their GPU capabilities with Metal Performance Shaders Graph (MPSGraph). This framework gave developers tools for building and running custom compute graphs, offering more flexibility and control than earlier approaches.

MPSGraph expanded Metal's compute capabilities to multi-dimensional tensors - the fundamental data structures in deep learning. It built on MPS's optimized primitives while adding support for sophisticated and dynamic neural network architectures. As one WWDC session put it, MPSGraph "extends Metal's Compute capabilities to multi-dimensional Tensors" and "builds on the highly tuned library of data parallel primitives that are vital to machine learning".

This framework enabled advanced optimizations across entire model graphs rather than just individual operations - things like operator fusion, memory optimization, and automatic differentiation. MPSGraph also handled dynamic shapes, making it great for architectures like RNNs and transformers with variable-length inputs.

For developers who needed fine-grained control, MPSGraph provided an alternative to Core ML, especially valuable for apps with heavy graphics needs where coordinating ML tasks with other GPU operations could optimize overall performance.

Apple's ML Stack Today

Today, Apple's ML infrastructure is a comprehensive ecosystem of hardware and software working together to power on-device intelligence. The stack keeps evolving, with recent updates focused on optimizing for transformers and diffusion models that drive generative AI experiences.

Hardware Evolution

Modern Apple devices leverage three main hardware components for ML:

1. CPU: Still important for certain models and operations, especially those with limited parallelism

2. GPU: Offers massive computational throughput for parallel operations and remains crucial for graphics-heavy apps that include ML

3. Neural Engine: Provides specialized acceleration for neural networks, delivering the best performance-per-watt for compatible architectures

These components work together under Apple's unified memory architecture, which eliminates the costly data transfers you'd see in platforms where CPU and GPU memory are separate. This approach cuts latency and power consumption, giving Apple a significant edge for on-device ML.

Software Framework Integration

The software side now includes multiple frameworks for different needs:

1. Accelerate with BNNS: Recently updated with BNNSGraph, making BNNS "faster, more energy efficient, and far, far easier to work with" - particularly good for real-time ML inference on the CPU with guarantees like no runtime memory allocation.

2. Core ML: The main high-level framework for deploying models, now enhanced with "more granular and composable weight compression techniques" for LLMs and diffusion models. Recent updates added support for models with multiple functions, efficient state management, and a new MLTensor type.

3. Metal with MPS and MPSGraph: Continues to evolve with new features for transformer models, improved compute bandwidth, and cool visualization tools like the "all new MPSGraph viewer" - making it easier to optimize complex models for Apple's GPUs.

What's New

Apple keeps pushing on-device ML forward with innovations like:

1. Transformer Optimizations: In 2022, Apple released "an open-source reference PyTorch implementation of the Transformer architecture, giving developers worldwide a way to seamlessly deploy their state-of-the-art Transformer models on Apple devices". It's "specifically optimized for the Apple Neural Engine" to "minimize the impact of ML inference workloads on app memory, app responsiveness, and device battery life".

2. Core ML Tools Enhancements: Better tools for model compression and optimization, with more granular weight compression options for efficiently running large models on resource-constrained devices.

3. Apple Intelligence: Their newest AI strategy, built on foundation models that run both on-device and in Apple's Private Cloud Compute. Notably, Apple has emphasized that they use their own ML stack from hardware up, without depending on NVIDIA hardware or CUDA API.

The Big Picture

Apple's ML journey from basic math libraries to today's sophisticated ecosystem shows their commitment to on-device intelligence, with performance, battery life, and privacy as top priorities.

From Accelerate and BNNS to Metal for GPU computing, and eventually to special-purpose hardware like the Neural Engine, Apple has consistently pursued tight integration between hardware and software. This approach has enabled increasingly powerful ML capabilities while staying true to their core values around privacy and user experience.

As ML transforms computing, Apple's unique position - controlling both hardware and software - gives them distinct advantages in delivering efficient, private, and powerful AI experiences. Their ML stack evolution shows how an integrated approach to complex computing challenges can produce superior results.

Apple's journey from simple math libraries to a full-blown AI powerhouse is quite the ride. Over the years, they've built an impressive ecosystem of hardware and software that runs machine learning right on your devices - prioritizing speed, battery life, and (of course) your privacy. Let's dive into how they pulled this off and what makes Apple's approach to AI unique in today's landscape.

The Early Days

It all started with the Accelerate framework - basically a collection of super-optimized math functions that leveraged CPU vector instructions to crunch numbers efficiently. This laid the groundwork for what would eventually become Apple's ML strategy.

Within Accelerate, Apple rolled out Basic Neural Network Subroutines (BNNS) - one of their first real attempts at dedicated ML tools. BNNS gave developers functions to build neural networks for both training and inference, optimized for Apple's CPUs. It included all the neural network essentials like convolutions, pooling, and fully connected layers, though it was pretty limited by today's standards.

Back then, everything ran on the CPU since Apple hadn't yet built specialized ML hardware. They were focused on getting traditional algorithms to run well on general-purpose processors - setting the stage for what was coming next.

Shifting to GPU Power

As neural networks got more complex and hungry for compute, Apple realized they needed to tap into the parallel processing power of GPUs. So they expanded Metal (originally a graphics API) into a computing platform for non-graphics tasks, including machine learning.

Metal's compute shaders gave developers GPU access for general calculations, which was perfect for accelerating the parallel operations that neural networks love. This was huge for convolutional neural networks (CNNs) that were taking over computer vision tasks, as they needed tons of parallel matrix operations that GPUs could handle much better than CPUs.

Apple's vertical integration strategy - controlling everything from chips to operating systems - was a major advantage here. Unlike competitors relying on third-party hardware and software, Apple could make system-level optimizations others couldn't match Apple's Custom GPU. For ML specifically, this meant tight integration between hardware, low-level software, and high-level frameworks.

Metal Performance Shaders

To squeeze even more ML performance from GPUs, Apple introduced Metal Performance Shaders (MPS) - a library of GPU-accelerated primitives optimized for image processing, linear algebra, and machine learning. MPS included pre-optimized kernels for common ML operations that delivered major performance gains over generic implementations.

By 2017, developers were actively comparing BNNS (CPU) against MPS CNN (GPU) for neural network inference. These benchmarks helped figure out which tech worked best for different use cases. Generally, GPUs crushed it for larger models with lots of parallelism, while CPUs held their own for smaller networks where transferring data to the GPU wasn't worth the trouble.

MPS marked an important step in Apple's ML strategy by providing finely-tuned implementations that fully leveraged their GPU architecture. This let developers build more sophisticated ML features while keeping performance and battery life in check.

Core ML

In 2017, Apple introduced Core ML, a high-level framework designed to make integrating ML models into apps dead simple. Core ML handled all the complexity of model inference, automatically picking the best hardware (CPU, GPU, or later the Neural Engine) for optimal performance.

Core ML was built with on-device processing as a core principle - everything happens right on your device with no need to send data to servers. As Apple's docs put it:

"Core ML models run strictly on the user's device and remove any need for a network connection, keeping your app responsive and your users' data private." This matched Apple's privacy-first approach while also enabling ML features even without internet access.

The framework supported all kinds of models - neural networks, tree ensembles, SVMs, you name it. For developers, Core ML made it super easy to convert models trained in TensorFlow or PyTorch into an optimized format for Apple devices. This democratized ML, letting developers without specialized expertise add smart features to their apps.

Apple Neural Engine

Apple's ML strategy took a massive leap forward with the Apple Neural Engine (ANE) in the A11 Bionic chip, which powered the iPhone X in 2017. The ANE was a total paradigm shift - dedicated hardware specifically designed to accelerate neural networks with crazy efficiency.

The ANE is basically a Neural Processing Unit (NPU) that specializes in operations like convolutions and matrix multiplications - the bread and butter of modern neural networks. It's like how GPUs accelerate graphics, but the ANE is laser-focused on neural network computations rather than general matrix math.

This specialized hardware delivered dramatic improvements in both speed and battery life. Models optimized for the ANE could run circles around the same models on CPU or GPU, while sipping power. This enabled way more sophisticated on-device ML without killing your battery - crucial for mobile devices.

As described in Apple's docs about running transformers on the Neural Engine:

"This architecture helps enable experiences such as panoptic segmentation in Camera with HyperDETR, on-device scene analysis in Photos, image captioning for accessibility, machine translation, and many others". The Neural Engine has gotten beefier with each generation of Apple silicon, appearing in iPhones, iPads, and now Macs.

But the ANE isn't perfect - it comes with limitations. "Not every Core ML model can make full use of the ANE". It works best with specific operations and architectures, so developers need to understand its quirks and sometimes redesign models to get the most out of it.

Metal Performance Shaders Graph

As ML models got more complex, Apple extended their GPU capabilities with Metal Performance Shaders Graph (MPSGraph). This framework gave developers tools for building and running custom compute graphs, offering more flexibility and control than earlier approaches.

MPSGraph expanded Metal's compute capabilities to multi-dimensional tensors - the fundamental data structures in deep learning. It built on MPS's optimized primitives while adding support for sophisticated and dynamic neural network architectures. As one WWDC session put it, MPSGraph "extends Metal's Compute capabilities to multi-dimensional Tensors" and "builds on the highly tuned library of data parallel primitives that are vital to machine learning".

This framework enabled advanced optimizations across entire model graphs rather than just individual operations - things like operator fusion, memory optimization, and automatic differentiation. MPSGraph also handled dynamic shapes, making it great for architectures like RNNs and transformers with variable-length inputs.

For developers who needed fine-grained control, MPSGraph provided an alternative to Core ML, especially valuable for apps with heavy graphics needs where coordinating ML tasks with other GPU operations could optimize overall performance.

Apple's ML Stack Today

Today, Apple's ML infrastructure is a comprehensive ecosystem of hardware and software working together to power on-device intelligence. The stack keeps evolving, with recent updates focused on optimizing for transformers and diffusion models that drive generative AI experiences.

Hardware Evolution

Modern Apple devices leverage three main hardware components for ML:

1. CPU: Still important for certain models and operations, especially those with limited parallelism

2. GPU: Offers massive computational throughput for parallel operations and remains crucial for graphics-heavy apps that include ML

3. Neural Engine: Provides specialized acceleration for neural networks, delivering the best performance-per-watt for compatible architectures

These components work together under Apple's unified memory architecture, which eliminates the costly data transfers you'd see in platforms where CPU and GPU memory are separate. This approach cuts latency and power consumption, giving Apple a significant edge for on-device ML.

Software Framework Integration

The software side now includes multiple frameworks for different needs:

1. Accelerate with BNNS: Recently updated with BNNSGraph, making BNNS "faster, more energy efficient, and far, far easier to work with" - particularly good for real-time ML inference on the CPU with guarantees like no runtime memory allocation.

2. Core ML: The main high-level framework for deploying models, now enhanced with "more granular and composable weight compression techniques" for LLMs and diffusion models. Recent updates added support for models with multiple functions, efficient state management, and a new MLTensor type.

3. Metal with MPS and MPSGraph: Continues to evolve with new features for transformer models, improved compute bandwidth, and cool visualization tools like the "all new MPSGraph viewer" - making it easier to optimize complex models for Apple's GPUs.

What's New

Apple keeps pushing on-device ML forward with innovations like:

1. Transformer Optimizations: In 2022, Apple released "an open-source reference PyTorch implementation of the Transformer architecture, giving developers worldwide a way to seamlessly deploy their state-of-the-art Transformer models on Apple devices". It's "specifically optimized for the Apple Neural Engine" to "minimize the impact of ML inference workloads on app memory, app responsiveness, and device battery life".

2. Core ML Tools Enhancements: Better tools for model compression and optimization, with more granular weight compression options for efficiently running large models on resource-constrained devices.

3. Apple Intelligence: Their newest AI strategy, built on foundation models that run both on-device and in Apple's Private Cloud Compute. Notably, Apple has emphasized that they use their own ML stack from hardware up, without depending on NVIDIA hardware or CUDA API.

The Big Picture

Apple's ML journey from basic math libraries to today's sophisticated ecosystem shows their commitment to on-device intelligence, with performance, battery life, and privacy as top priorities.

From Accelerate and BNNS to Metal for GPU computing, and eventually to special-purpose hardware like the Neural Engine, Apple has consistently pursued tight integration between hardware and software. This approach has enabled increasingly powerful ML capabilities while staying true to their core values around privacy and user experience.

As ML transforms computing, Apple's unique position - controlling both hardware and software - gives them distinct advantages in delivering efficient, private, and powerful AI experiences. Their ML stack evolution shows how an integrated approach to complex computing challenges can produce superior results.

Next articles:

Other articles to read:

Try Mirai – AI which run directly on your devices, bringing powerful capabilities closer to where decisions are made.

Hassle-free app integration, lightning-fast inference, reliable structured outputs