Part 3: iPhone Hardware and How It Powers On-Device AI

By

Artur Chakhvadze

Mar 24, 2025

The iPhone 16 Pro's mainboard contains an Apple A18 Pro SoC stacked below an 8GB Micron LPDDR5X SDRAM package running at 3750MHz. The system-on-chip comprises three principal compute units: a (4+2)-core CPU, a 6-core GPU, and a 16-core Neural Engine. It uses a Unified Memory Architecture (UMA), meaning all three compute units share the same physical memory, providing a total memory bandwidth of 60 GB/s.

iPhone Main board, Image Source: iFixit
A18 Pro die scan, Image source: Chipwise

GPU

The A18 Pro GPU features six cores (five cores in the non-Pro A18 version), roughly analogous to Nvidia Streaming Multiprocessors. Each GPU core is equipped with a block of SRAM memory shared between threads, executing kernels in grids of up to 1024 threads, called threadgroups (thread blocks in CUDA terminology).

Threadgroups are organized into multiple SIMD-groups consisting of 32 parallel threads, conceptually analogous to CUDA warps. Each SIMD-group has its own fast SRAM layer, storing register values and thread-local variables. Threads within the same SIMD-group can communicate through collective communication primitives, bypassing slower threadgroup memory. Threads can synchronize using memory barriers or atomic operations.

Threads within a SIMD-group share the same instruction pointer and a 32-bit predicated execution mask. They execute instructions concurrently across multiple ALUs, writing results based on the execution mask. Branching control flow within a SIMD-group leads to divergence, causing each branch to execute sequentially, thus degrading performance proportionally to the number of branches.

Warp divergence, Image Source: Nvidia

Starting with the Volta architecture, Nvidia GPUs introduced warp scheduling, allowing interleaved execution of divergent branches within a warp, enabling communication between divergent threads. Unfortunately, Apple GPUs lack a similar mechanism. Attempting to run collective communication instructions within divergent branches results in deadlocks.

Each thread performs scalar, vector, and matrix operations on integers (1, 8, 16, 32, and 64-bit) and 16 and 32-bit (b)floats. Instructions for different data types can run in parallel across different ALUs, enhancing performance.

Following Nvidia’s introduction of Tensor Cores, Apple has introduced SIMD-group-level matrix data types with matrix multiplication executed on systolic array compute units. SIMD-groups collectively load 8x8 matrix fragments from threadgroup memory, and performin synchronized multiply-accumulate operations without the need for intermediate data transfers through registers.

Unfortunately, while Nvidia has added support for quantized 8 and 4-bit matrix data types in Ampere and Hopper GPUs, Apple only supports 16 and 32-bit floats. This means that it is impossible to implement efficient quantized matrix multiplication GPU-kernels, and the only benefit of model quantization on GPU is reduced memory pressure.

We benchmarked the GPU speed in matrix multiplications tasks using Apple MPSGraph kernels and measured the peak performance of 1800 GFLOP/s. Surprisingly, the results do not depend on the data type, and there is no benefit of using lower precision floats in tasks with high arithmetic intensity. This result, as well as limited support for quantized matrix multiplications suggests Apple don’t consider GPU to be a primary compute platform for machine learning workloads, prioritizing the Neural Engine instead.

GPU Benchmark results, Matmul (M = N = K), Source: Mirai

Neural Engine

Apple is notably reluctant to share detailed information about the Neural Engine architecture, forcing developers to infer details from limited software documentation, microbenchmarks, analysis of similar chips, and reverse engineering.

Typically, AI accelerators such as Google TPU, Huawei DaVinci, and Qualcomm Hexagon follow similar design principles. They consist of identical cores built around large matrix multiplication systolic arrays, supplemented by scalar and vector coprocessors and SRAM blocks. It is reasonable to assume Apple's Neural Engine follows a similar approach.

Huawei Da Vinci, Image Source: Huawei
Qualcomm Hexagon, Source: Hexagon
Google TPU, Source: Google

While developers can write custom GPU kernels using the Metal programming language, the only direct way to execute programs on the Neural Engine is via the Metal Performance Shaders Graph API. It is a TensorFlow-like graph execution framework, supporting CPU, GPU and Neural Engine backends. Under the hood, MPSGraph executable objects are compiled into MLIR representation, which is then lowered into a set of kernels suitable for execution on the GPU or Neural Engine backends. Subgraphs supported by the Neural Engine are represented as ANERegionCall operations, and are scheduled to be executed by a separate system process. This means that any data that is not a part of the compiled subgraph has to be passed across the inter-process boundary using the IOSurface mechanism, which may introduce significant overhead. This presents challenges when optimizing machine learning models for the execution on the Neural Engine, since any operation not supported by the Neural Engine will require expensive inter-process data transfers.

Apple doesn't publish any documentation specifying the rules that govern the operation placement decisions made by the MPSGraph compiler. We found some pieces of documentation to be conflicting, and many MPSGraph and CoreML API reference entries to be incomplete or outright wrong. The following findings are results of empirical experiments, and may contain errors.

The A16 Pro Neural Engine primarily performs int8 x int8 and float16 x float16 matrix multiplications, lacking support for float32 or bfloat16 data types. Inputs can optionally use int4 or int8 linear quantization, with int8 supporting groupwise quantization and int4 suprtizingly limited to channel-wise quantization.

The limited support for data types and quantization schemes makes adapting modern neural net architectures for Neural Engine quite challenging. For example, in order to avoid overflows, RMSNorm layers represents activation statistics for each token as 32-bit floating point numbers. Since 32-bit floats are not supported by ANE, direct implementation of RMSNorm in MPSGraph will have to move activations from ANE to GPU and back, adding significant latency in the process. Moreover, efficient execution of operations on a Neural Engine requires "baking" of model weights into the graph as constatnts. This means that on-device model trainig or online adaptation can only be efficiently performed on the GPU.

In our matrix multiplication benchmarks we have measured a peak performance of 27000 GFLOP/s when using int4 per-channel quantized weights. This result comes short of the 35 TOPS claimed by Apple. We were are also surprised to learn that the choice of the weight quantization scheme has such a significant impact on the overall performance of the kernel.

Neural Engine Benchmark results, Matmul, Source: Mirai

Next article: Part 4, Brief History if Apple ML Stack.

Helpful links

iFixit Chip ID:

https://www.ifixit.com/Guide/iPhone+16+Pro+Chip+ID/177358

MIL:

https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html

MPSGraph:

https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph

https://developer.apple.com/videos/play/wwdc2024/10218/

https://developer.apple.com/videos/play/wwdc2023/10050/

https://developer.apple.com/videos/play/wwdc2022/10063/

https://developer.apple.com/videos/play/wwdc2021/10152/

https://developer.apple.com/videos/play/wwdc2020/10677/

ANE Reverse Engineering:

https://eclecticlight.co/2022/03/30/the-hunt-for-the-m1s-neural-engine/

https://eclecticlight.co/2022/03/29/live-text-visual-look-up-face-recognition-ml-and-privacy/

https://github.com/tinygrad/tinygrad/tree/d0e752003da3fc023fa85094d7f5b65b47dd5091/extra/accel/ane

https://github.com/eiln/ane

https://www.youtube.com/watch?v=1wvBDUnPNEo

ANE Patents:

https://patentimages.storage.googleapis.com/09/94/b0/33e4247e137a73/US20220237438A1.pdf

https://patentimages.storage.googleapis.com/a4/83/a8/ad9d221cb7f8d8/US20190340498A1.pdf

Metal:

https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf

https://developer.apple.com/videos/play/wwdc2022/10066/

https://developer.apple.com/videos/play/tech-talks/10580/

Nvidia:

https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.1.pdf

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture

https://docs.nvidia.com/cuda/parallel-thread-execution

https://resources.nvidia.com/en-us-tensor-core

https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

Hexagon:

https://chipsandcheese.com/p/qualcomms-hexagon-dsp-and-now-npu

TPU:

https://arxiv.org/abs/1704.04760

https://jax-ml.github.io/scaling-book/

Systolic Arrays:

https://www.eecs.harvard.edu/htk/static/files/1978-cmu-cs-report-kung-leiserson.pdf

https://safari.ethz.ch/digitaltechnik/spring2018/lib/exe/fetch.php?media=onur-digitaldesign-2018-lecture23a-systolic-arrays-and-beyond-afterlecture.pdf

https://www.youtube.com/watch?si=TTIqAapld1l-YPzw&v=QOi6ctI4W-8&feature=youtu.be

The iPhone 16 Pro's mainboard contains an Apple A18 Pro SoC stacked below an 8GB Micron LPDDR5X SDRAM package running at 3750MHz. The system-on-chip comprises three principal compute units: a (4+2)-core CPU, a 6-core GPU, and a 16-core Neural Engine. It uses a Unified Memory Architecture (UMA), meaning all three compute units share the same physical memory, providing a total memory bandwidth of 60 GB/s.

iPhone Main board, Image Source: iFixit
A18 Pro die scan, Image source: Chipwise

GPU

The A18 Pro GPU features six cores (five cores in the non-Pro A18 version), roughly analogous to Nvidia Streaming Multiprocessors. Each GPU core is equipped with a block of SRAM memory shared between threads, executing kernels in grids of up to 1024 threads, called threadgroups (thread blocks in CUDA terminology).

Threadgroups are organized into multiple SIMD-groups consisting of 32 parallel threads, conceptually analogous to CUDA warps. Each SIMD-group has its own fast SRAM layer, storing register values and thread-local variables. Threads within the same SIMD-group can communicate through collective communication primitives, bypassing slower threadgroup memory. Threads can synchronize using memory barriers or atomic operations.

Threads within a SIMD-group share the same instruction pointer and a 32-bit predicated execution mask. They execute instructions concurrently across multiple ALUs, writing results based on the execution mask. Branching control flow within a SIMD-group leads to divergence, causing each branch to execute sequentially, thus degrading performance proportionally to the number of branches.

Warp divergence, Image Source: Nvidia

Starting with the Volta architecture, Nvidia GPUs introduced warp scheduling, allowing interleaved execution of divergent branches within a warp, enabling communication between divergent threads. Unfortunately, Apple GPUs lack a similar mechanism. Attempting to run collective communication instructions within divergent branches results in deadlocks.

Each thread performs scalar, vector, and matrix operations on integers (1, 8, 16, 32, and 64-bit) and 16 and 32-bit (b)floats. Instructions for different data types can run in parallel across different ALUs, enhancing performance.

Following Nvidia’s introduction of Tensor Cores, Apple has introduced SIMD-group-level matrix data types with matrix multiplication executed on systolic array compute units. SIMD-groups collectively load 8x8 matrix fragments from threadgroup memory, and performin synchronized multiply-accumulate operations without the need for intermediate data transfers through registers.

Unfortunately, while Nvidia has added support for quantized 8 and 4-bit matrix data types in Ampere and Hopper GPUs, Apple only supports 16 and 32-bit floats. This means that it is impossible to implement efficient quantized matrix multiplication GPU-kernels, and the only benefit of model quantization on GPU is reduced memory pressure.

We benchmarked the GPU speed in matrix multiplications tasks using Apple MPSGraph kernels and measured the peak performance of 1800 GFLOP/s. Surprisingly, the results do not depend on the data type, and there is no benefit of using lower precision floats in tasks with high arithmetic intensity. This result, as well as limited support for quantized matrix multiplications suggests Apple don’t consider GPU to be a primary compute platform for machine learning workloads, prioritizing the Neural Engine instead.

GPU Benchmark results, Matmul (M = N = K), Source: Mirai

Neural Engine

Apple is notably reluctant to share detailed information about the Neural Engine architecture, forcing developers to infer details from limited software documentation, microbenchmarks, analysis of similar chips, and reverse engineering.

Typically, AI accelerators such as Google TPU, Huawei DaVinci, and Qualcomm Hexagon follow similar design principles. They consist of identical cores built around large matrix multiplication systolic arrays, supplemented by scalar and vector coprocessors and SRAM blocks. It is reasonable to assume Apple's Neural Engine follows a similar approach.

Huawei Da Vinci, Image Source: Huawei
Qualcomm Hexagon, Source: Hexagon
Google TPU, Source: Google

While developers can write custom GPU kernels using the Metal programming language, the only direct way to execute programs on the Neural Engine is via the Metal Performance Shaders Graph API. It is a TensorFlow-like graph execution framework, supporting CPU, GPU and Neural Engine backends. Under the hood, MPSGraph executable objects are compiled into MLIR representation, which is then lowered into a set of kernels suitable for execution on the GPU or Neural Engine backends. Subgraphs supported by the Neural Engine are represented as ANERegionCall operations, and are scheduled to be executed by a separate system process. This means that any data that is not a part of the compiled subgraph has to be passed across the inter-process boundary using the IOSurface mechanism, which may introduce significant overhead. This presents challenges when optimizing machine learning models for the execution on the Neural Engine, since any operation not supported by the Neural Engine will require expensive inter-process data transfers.

Apple doesn't publish any documentation specifying the rules that govern the operation placement decisions made by the MPSGraph compiler. We found some pieces of documentation to be conflicting, and many MPSGraph and CoreML API reference entries to be incomplete or outright wrong. The following findings are results of empirical experiments, and may contain errors.

The A16 Pro Neural Engine primarily performs int8 x int8 and float16 x float16 matrix multiplications, lacking support for float32 or bfloat16 data types. Inputs can optionally use int4 or int8 linear quantization, with int8 supporting groupwise quantization and int4 suprtizingly limited to channel-wise quantization.

The limited support for data types and quantization schemes makes adapting modern neural net architectures for Neural Engine quite challenging. For example, in order to avoid overflows, RMSNorm layers represents activation statistics for each token as 32-bit floating point numbers. Since 32-bit floats are not supported by ANE, direct implementation of RMSNorm in MPSGraph will have to move activations from ANE to GPU and back, adding significant latency in the process. Moreover, efficient execution of operations on a Neural Engine requires "baking" of model weights into the graph as constatnts. This means that on-device model trainig or online adaptation can only be efficiently performed on the GPU.

In our matrix multiplication benchmarks we have measured a peak performance of 27000 GFLOP/s when using int4 per-channel quantized weights. This result comes short of the 35 TOPS claimed by Apple. We were are also surprised to learn that the choice of the weight quantization scheme has such a significant impact on the overall performance of the kernel.

Neural Engine Benchmark results, Matmul, Source: Mirai

Next article: Part 4, Brief History if Apple ML Stack.

Helpful links

iFixit Chip ID:

https://www.ifixit.com/Guide/iPhone+16+Pro+Chip+ID/177358

MIL:

https://apple.github.io/coremltools/source/coremltools.converters.mil.mil.ops.defs.html

MPSGraph:

https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph

https://developer.apple.com/videos/play/wwdc2024/10218/

https://developer.apple.com/videos/play/wwdc2023/10050/

https://developer.apple.com/videos/play/wwdc2022/10063/

https://developer.apple.com/videos/play/wwdc2021/10152/

https://developer.apple.com/videos/play/wwdc2020/10677/

ANE Reverse Engineering:

https://eclecticlight.co/2022/03/30/the-hunt-for-the-m1s-neural-engine/

https://eclecticlight.co/2022/03/29/live-text-visual-look-up-face-recognition-ml-and-privacy/

https://github.com/tinygrad/tinygrad/tree/d0e752003da3fc023fa85094d7f5b65b47dd5091/extra/accel/ane

https://github.com/eiln/ane

https://www.youtube.com/watch?v=1wvBDUnPNEo

ANE Patents:

https://patentimages.storage.googleapis.com/09/94/b0/33e4247e137a73/US20220237438A1.pdf

https://patentimages.storage.googleapis.com/a4/83/a8/ad9d221cb7f8d8/US20190340498A1.pdf

Metal:

https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf

https://developer.apple.com/videos/play/wwdc2022/10066/

https://developer.apple.com/videos/play/tech-talks/10580/

Nvidia:

https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.1.pdf

https://docs.nvidia.com/cuda/cuda-c-programming-guide/#simt-architecture

https://docs.nvidia.com/cuda/parallel-thread-execution

https://resources.nvidia.com/en-us-tensor-core

https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

Hexagon:

https://chipsandcheese.com/p/qualcomms-hexagon-dsp-and-now-npu

TPU:

https://arxiv.org/abs/1704.04760

https://jax-ml.github.io/scaling-book/

Systolic Arrays:

https://www.eecs.harvard.edu/htk/static/files/1978-cmu-cs-report-kung-leiserson.pdf

https://safari.ethz.ch/digitaltechnik/spring2018/lib/exe/fetch.php?media=onur-digitaldesign-2018-lecture23a-systolic-arrays-and-beyond-afterlecture.pdf

https://www.youtube.com/watch?si=TTIqAapld1l-YPzw&v=QOi6ctI4W-8&feature=youtu.be

Try Mirai – AI which run directly on your devices, bringing powerful capabilities closer to where decisions are made.

Hassle-free app integration, lightning-fast inference, reliable structured outputs