Matt Kurian, Google Cloud's CEO, walked on stage at Google Cloud Next in Las Vegas on April 22, 2026, and announced that the experimental phase of artificial intelligence is officially over. What made that moment stick was not the rhetoric—it was the hardware. Google had just split the TPU, its flagship accelerator chip, into two permanently distinct architectures for the first time in a decade. One for training, one for serving. The TPU 8t and the TPU 8i. Two chips. One generation. A deliberate, structural bet that the era of compromise in AI infrastructure has ended.

For years, Google shipped unified TPUs that did both jobs reasonably well and neither job optimally. That was a feature when AI workloads were experimental and customers needed flexibility. It becomes a liability the moment most of your customers are running production inference at scale. Kurian noted that nearly 75% of Google Cloud customers are already using AI in production. The market signaled what Google needed to hear: the bottleneck is no longer raw training power. It is serving models fast and cheap. Gartner analyst Chirag Dekate said it plainly: the battleground is shifting towards inference. Google responded by building two different weapons.

The TPU 8t is the training specialist. A single superpod of TPU 8t chips scales to 9,600 processors and 2 petabytes of shared high-bandwidth memory, delivering 121 ExaFlops of compute—2.8 times the performance of its predecessor, the Ironwood TPU, at the same price. What matters more than raw numbers is the architecture Google built around it. The new Virgo Network, combined with JAX and Pathways software, enables near-linear scaling for up to one million chips in a single logical cluster. That is the claim Google made. Near-linear scaling to a million chips. Development cycles for frontier models drop from months to weeks. The infrastructure exists, in theory, to train a model the size of GPT-5 or its descendant without the communication overhead that currently limits cluster efficiency. Google DeepMind co-designed the TPU 8t with Broadcom. Google is using its own Axion ARM-based CPUs to host it. This is vertical integration at the infrastructure layer.

The TPU 8i is the inference specialist, and it reveals where Google actually thinks the money is. This chip triples on-chip SRAM to 384 megabytes, doubles inter-chip interconnect bandwidth to 19.2 terabits per second, and introduces a new Boardfly topology that reduces network diameter by roughly 56% for mixture-of-experts and reasoning workloads. Performance-per-dollar improves 80% over the previous generation. The math is direct: companies can serve nearly twice the customer volume at the same cost. MediaTek designed the TPU 8i. Google DeepMind was in the room. The configuration scales up to 1,152 chips and delivers 11.6 ExaFlops. Inference does not need as much raw power as training. It needs low latency, high memory bandwidth, and the ability to parallelize token generation across thousands of simultaneous requests without bottlenecking. The TPU 8i was built for exactly that problem.

Two anchor customers just bet their infrastructure roadmaps on Google's ability to deliver. Anthropic, which announced access to up to one million TPUs in a deal worth tens of billions of dollars, is expanding that commitment through a Google and Broadcom agreement for multiple gigawatts of next-generation TPU capacity beginning in 2027. Meta struck a multibillion-dollar agreement to procure TPUs via Google Cloud. Both deals are contingent on Google's claims about scaling efficiency and cost structure holding in production. If they do not, Anthropic and Meta will have locked themselves into a secondary supplier relationship with worse economics than staying with Nvidia. The risk is real. The bet is bigger.

This bifurcation exposes something fundamental about Nvidia's position. Nvidia ships monolithic GPUs that are general-purpose enough to handle training, inference, and everything in between. That flexibility was an advantage when no one knew what AI infrastructure would look like. It becomes a disadvantage the moment specialized chips exist. The data suggests inference is the higher-value problem to optimize for—it is where nearly all production workloads actually run. Nvidia's H100 and H200 are built as training chips first, inference chips second. Google just said the order is backwards and built silicon to prove it. Amin Vahdat, Google's senior vice president for AI infrastructure, stated it directly: with the rise of AI agents, the community benefits from chips individually specialized to the needs of training and serving. That is not a market segmentation argument. That is a claim that the training-and-inference dichotomy is no longer a temporary condition—it is permanent. Agents need both capabilities, but rarely in the same ratios. A training run happens once. Inference happens billions of times.

Google is also quietly solving a decade-old friction point. Framework lock-in has kept workloads on Nvidia by default. TensorFlow and TPUs go together. Everyone else runs CUDA on GPUs. Google is now offering bare-metal TPU access and native PyTorch support. That removes the primary technical reason a researcher or team would default to Nvidia hardware when cost and performance are equal. If Google can demonstrate that TPU 8t and 8i actually deliver on the scaling claims and cost targets, the default changes. Hyperscalers like Meta and Anthropic have the scale to absorb switching costs and negotiate directly with Google. Mid-market companies with fewer leverage points will follow if the numbers look credible. Nvidia remains the safer choice for customers who cannot afford to bet on unproven scaling claims. But Google is no longer asking for faith. It is asking for the right to compete on performance and economics. Anthropic and Meta just said yes.

Watch three things. First, general availability for both TPU 8t and 8i in 2026. Any slippage signals problems with the Virgo Network scaling or on-chip memory yields. Second, whether the Broadcom gigawatt-scale TPU agreement with Anthropic actually begins in 2027 at the scale promised. That is where you discover if Google's cost structure is real or if the economics only work at test-bed scale. Third, the million-chip clustering test in live production. Google claims near-linear scaling to a million chips. Nvidia's NVLink fabric tops out around 32,000 chips efficiently. If Google's Virgo Network actually delivers what the brief claims, it is a different game. If it does not, Google has a very expensive solution to a problem that does not exist yet. The announcement is confident. The engineering is real. The proof is still months away.