Tenstorrent ships Galaxy Blackhole, taking on Nvidia in inference

Tenstorrent began volume production of Galaxy Blackhole, a $110,000 AI inference server that claims 23 PFLOPS performance and 350+ tokens per second, challenging Nvidia's GPU dominance with a clean-sheet hardware design.

Jim Keller's new machine learned to generate 720-pixel video at 30 frames per second in 2.4 seconds. The system that did it, a cluster of Tenstorrent Galaxy Blackhole servers networked together at Prodia Labs, is neither a research prototype nor a press-release fantasy. It shipped in May 2026 and is running customer workloads right now. This matters because for nearly a decade, Nvidia has owned AI infrastructure by default. Tenstorrent just made that assumption dangerous.

The inference chip market has bifurcated. Nvidia dominates training with H100s and H200s because they own the software stack, CUDA, cuDNN, TensorRT, and developers have built billion-dollar workflows around them. But inference is different. Inference is margin-thin, latency-critical, and cost-sensitive. It is the business where a 30 percent improvement in tokens-per-dollar actually moves an operator's unit economics. Tenstorrent entered in April 2025 with a narrower bet: build a system that runs off-the-shelf models fast and cheap, without forcing customers to retrain or recompile. Galaxy Blackhole is that bet, fully built and deployed.

Here is what the hardware actually is. Each Galaxy Blackhole contains 32 Tenstorrent Blackhole chips arranged in a single server, delivering 23 PFLOPS of Block FP8 compute from 6.2 GB of on-chip SRAM with 2.9 PB/s of bandwidth. It has 1 TB of GDDR6 memory with 16 TB/s bandwidth. Sixteen Galaxy units can cable together, all-to-all Ethernet topology, 56 × 800G ports per system, 11.2 GB/s scale-out bandwidth, and four Galaxies form a quad, which is then replicated into superclusters. The base supercluster of four Galaxies starts at $440,000. A single Galaxy Blackhole server starts at $110,000. That price point is the story's first shock. Nvidia's eight-way DGX boxes, capable of higher absolute throughput, cost between $300,000 and $550,000 depending on configuration. Tenstorrent is not faster on every workload. But on inference, the workload that actually generates revenue for LLM API providers, it trades off peak training performance for inference cost-efficiency.

Tenstorrent's own benchmarks show why operators are already buying. On DeepSeek-R1-0528, a 671-billion-parameter model, Galaxy Blackhole in 'Blitz Mode' achieves 350-plus tokens per second per concurrent user and sub-4-second time-to-first-token. Prodia Labs, running video generation workloads, reports 10x faster generation than the previous best GPU-based system. These are not cherry-picked microbenchmarks. Artificial Analysis, the independent inference leaderboard that ranks models and hardware by real-world performance, validates these numbers publicly. More important: 16 Galaxy Blackhole servers are already installed at Equinix's Ashburn data center. Turiyam, an image-as-a-service provider, has ordered up to 32 units for deployment in India. Cirrascale, which builds inference clouds for enterprise customers, is taking on Galaxy as a core infrastructure product. AI&, a new Japanese AI startup, is deploying Galaxies into production. These are not press-release customers. Equinix does not install hardware that does not work. Cirrascale does not bet its entire cloud offering on unproven systems.

The software story is where Tenstorrent's advantage actually compounds. The company's compiler, TT-Forge, achieves a 90 percent pass rate on models pulled directly from Hugging Face, roughly 2.5 million AI models. It can ingest PyTorch, TensorFlow, CUDA, ONNX, and even text from AI papers and compile them to run on the hardware without manual optimization. The entire stack is open source. This matters more than it sounds. A data center operator who switches from Nvidia to Tenstorrent does not need to rewrite pipelines or hire new compiler engineers. They point their existing model zoo at the Galaxy hardware and it works. That eliminates the single biggest switching cost in AI infrastructure: the retraining tax. Nvidia's ecosystem strength is real, but it assumes customers want to stay locked in. If you give customers a way to leave that costs nothing, they leave.

Who wins and who loses is now clear. Tenstorrent wins with inference operators who prioritize cost per token and cannot afford to be locked into Nvidia's software ecosystem. Equinix wins because it can now offer inference capacity at lower cost than competitors using pure Nvidia. Cirrascale wins because it can offer a differentiated product. Nvidia does not lose the training market, Galaxy Blackhole cannot replace H200s for model development. But Nvidia's inference margin story is now challenged. If Tenstorrent holds its $6-per-million-tokens roadmap target, and Nvidia's inference price floor remains $8-12 per million tokens, operators doing inference at scale will choose Tenstorrent. The architectural insight here is Jim Keller's: disaggregated medium-performance chips connected by commodity Ethernet scale to the same throughput as monolithic high-performance GPUs, but at lower cost and simpler software. That insight is both sound and hard to argue against once you see the benchmark data.

The real question is manufacturing and scale. Tenstorrent has backing from Bezos Expeditions, Samsung, LG Electronics, Hyundai, and Fidelity. The company raised over $1 billion and hit a $2.6 billion Series D valuation in December 2024. It has operations in Santa Clara, Austin, Toronto, Belgrade, Tokyo, and Bangalore. It has the capital and the talent. But it does not have Nvidia's supply chain relationships with TSMC, and it does not have a decade of software integration into every major AI framework. Volume production is the tell. If Tenstorrent can deliver 200-500 Galaxy superclusters per year over the next 18 months, it wins the inference market. If it hits capacity at 20-30 superclusters and develops a backlog that stretches past 2027, Nvidia stabilizes and defends its inference margin until something else disrupts it. The company's own roadmap targets 500 tokens per second per user at $6 per million tokens, a 50 percent cost reduction from the current Galaxy while maintaining or exceeding performance. If they hit that, Nvidia's inference business becomes a rounding error.

Watch three things. First, Equinix's next earnings call in July 2026, management will disclose whether the 16-unit Galaxy deployment at Ashburn is expanding or stalling. If it is expanding beyond 50 units, Equinix is betting the company on Tenstorrent as an alternative to Nvidia. Second, Cirrascale's customer adoption rates over the next two quarters. If Cirrascale is fielding Galaxy systems to 10-plus enterprise customers by Q3 2026 and those deployments result in renewal rates above 85 percent, the inference shift is real. Third, Turiyam's India deployment timeline. India has been desperate for inference capacity at lower cost than US public clouds. If Turiyam actually deploys 32 units and achieves 80 percent utilization by October 2026, Tenstorrent has proof that it can execute outside Silicon Valley's ecosystem.