On June 12th 2025, the world was reminded how fragile our AI infrastructure really is. When GCP and Cloudflare both went down for several hours, they took half the internet with them, including the majority of big AI providers like Gemini and Anthropic.

This cascade of failures rendered AI apps like Cursor and Gemini and Claude essentially unusable. For millions of users, their AI assistants simply vanished. To be fair to the providers, outages at this scale are quite rare, the last major outage was when Fastly brought down Amazon, Reddit, Spotify and others for 49 minutes in 2021. However, it's a reminder that nearly all AI applications today are heavily dependent on the cloud and a stable internet connection.
We Have the Hardware
Sam Altman recently mentioned that running a single ChatGPT query is equivalent to turning on an oven for one second ≈0.34Wh per prompt. That might not sound like much, but imagine if every time you asked your phone a question, you were firing up your oven. That's where Neural Processing Units (NPUs) come in. These specialized chips are designed to run AI inference (the process of actually using a trained model, as opposed to training it) at 3-8x less power than traditional processors.

“NPU” has become the catch-all label for on-chip AI horsepower, but every vendor slaps its own badge on the same idea. Apple ships a Neural Engine (ANE), Qualcomm leans on Hexagon, MediaTek calls it an APU, Arm brands Ethos, Intel waves AI Boost, AMD talks up XDNA, Google rolls out TPU in the cloud and Edge TPU on devices, Graphcore touts an IPU, Horizon Robotics sells a BPU, and the list keeps growing. Under the stickers? All of them are matrix engines built to chew through neural-net math - some even stretch to full-blown training.
This echoes the early days of computing when architectural incompatibility was the norm. IBM System/360 code (1964) wouldn't boot on a DEC PDP-11 (1970), and today an ANE binary won't magically light up on a rival NPU. IBM System/360 code wouldn’t boot on a DEC PDP-11, and today an ANE binary won’t magically light up on a rival NPU. Different decade, same fragmentation - just with tensors instead of punched cards. Throughout the 1970s minicomputer revolution, incompatible architectures forced developers to choose between competing platforms, fragmenting the software ecosystem.
The industry is attempting to solve this NPU fragmentation through abstraction layers, much like how graphics APIs eventually unified GPU programming. Google's Android Neural Networks API (NNAPI) tried to abstract NPU access across Android devices, though it was deprecated as an NDK API in Android 15, with developers now migrating to TensorFlow Lite or hardware-specific SDKs. Apple's Core ML provides a unified interface for their Neural Engine. ONNX promises model portability across platforms. But unlike graphics APIs that had decades to mature, or CUDA that could dominate through sheer market force, NPU standards are fragmenting faster than they're converging. Cross-platform solutions like ONNX or Foundry Local exist, but integration with native applications often reveals they're outdated, supporting models that are months behind state-of-the-art.

But We Can't Use It
The fragmentation is real, and it's holding everyone back. Apple's recent Foundation Model Framework shows what's possible with the right infrastructure - what used to take us two weeks to implement (getting ASR and local LLMs running on a Mac) now takes less than a day, especially with their new SpeechAnalyzer and SpeechTranscriber APIs that offer significantly better ASR models than their previous speech recognition offerings. But that's just on Apple's tightly controlled ecosystem and users have to be running OS 26+. And it was mentioned that model updates may happen with OS updates, it's worrisome if the foundational models are updated once a year but if it's too frequent, it becomes a pain to maintain as well!
Microsoft seems to not be able to make up their mind about how to brand their toolchain either. From Windows AI to Windows App SDK to Windows Copilot Runtime to Windows AI Foundry (their latest rebrand), the frequent rebranding itself illustrates the confusion and uncertainty in the space. Even Microsoft's Phi-silica model that's supposed to ship with Windows 11 is still in developer preview after eight months, and you can only access it on Windows devices running Qualcomm chips.
Microsoft has the challenge of supporting multiple chip makers, and because of that they're still behind Apple. Foundry Local has support for Qualcomm with phi-4-mini-reasoning (Apr 2025) and deepseek-r1 (Jan 2025) on NPU, but other chips still have to run inference on CPU/GPU. Note that phi-4-mini-reasoning is a different model from the standard phi-4-mini, and developers report various implementation challenges even with the officially supported versions.

Then you have to worry about state-of-the-art models changing every couple of months and working with chip makers to ensure that minor changes in training translate properly when running inference on each platform.
The conversion pipeline itself is a nightmare. Most models start life as PyTorch checkpoints trained on NVIDIA GPUs. Getting them to run on edge devices means navigating a maze of format conversions: PyTorch → Core ML for Apple, or PyTorch → TensorFlow Lite → Android NNAPI, or PyTorch → DirectML for Windows. Each conversion step introduces potential precision loss, unsupported operations, and performance degradation. What worked perfectly in your training environment might completely fail or run 10x slower on the target device.
Most machine learning engineers operate in a Python-rich world. Models are trained and tuned in Python - but application developers work in their native language. Folks building solutions for endpoint devices need to distribute their apps in the native format. That means additional support for Swift, .NET, TypeScript, Rust, C++, Go, Kotlin and many other languages that users can choose from. The SDK fragmentation alone is enough to make most developers stick with cloud APIs. It's not just converting LLM or ASR models either, once users get a taste of AI rich features, they want more.
And Users Want Everything
You can offer a privacy first, cost efficient and offline experience to the end user, but it needs to be nearly or almost as valuable as the cloud-first solutions, or else most of your users will not stick around. Only those that extremely value privacy or have strict compliance requirements may end up making the trade off.
Once you start building AI native applications, you will run into requests that require speaker diarization, vision capabilities, OCR, text-to-speech and the likes that have traditionally been running on the cloud with some GPU. Each of these requires their own model with different architectures. The fragmentation problem compounds exponentially. It's hard enough getting a single LLM to run efficiently across different chips. Now multiply that by every AI capability users expect.
Not only that, most machine learning engineers operate in a Python-rich world. Models are trained and tuned in Python - but application developers work in their native language. Folks building solutions for endpoint devices need to distribute their apps in the native format. That means additional support for Swift, .NET, TypeScript, Rust, C++, Go, Kotlin and many other languages that users can choose from.
The Path to Intelligence Everywhere
Despite these challenges, the momentum is undeniable. Small language models are getting remarkably good, good enough for many everyday tasks. The power efficiency gains are real. And the demand for offline, private, always-available AI is only growing.
We're at an inflection point. The hardware is here. The models are quite capable already. What we need now is the connective tissue: the standards, frameworks, and tools that will make edge AI as seamless as cloud AI has become. Like the shift in application code deployment on CPUs, we're unlikely to see a holistic solution where the world converges on one architecture. However, it needs to narrow down to 2-3, not one per chip maker. Even then, the underlying hardware should ideally be abstracted from the end developer through a platform-agnostic layer, like how Docker or the JVM made application code deployment simple.
The underlying building blocks for models on NPUs and techniques for conversion should not be the moat. They need to be open-sourced and shared with the community to let everyone accelerate the timeline. We all win if more inference happens on-device.
By 2030, we'll look back at today's cloud-dependent AI the same way we look at dial-up internet. Quaint, necessary for its time, but ultimately a stepping stone to something far more powerful and pervasive. Software took off the moment code became pure logic - portable bytes that ignored the underlying silicon. When AI models reach that same level of portability and cost-efficiency, intelligence will move everywhere we are.