Deep

Tenstorrent's $110K AI Server Challenges Nvidia's Inference Dominance

HyperSinc Intelligence/MAY 3, 2026

Tenstorrent launches Galaxy Blackhole inference servers at 3-5x cheaper per node than Nvidia DGX, with 16 units already deployed at Equinix and performance claims that undercut the GPU+LPU disaggregation trend.

Tenstorrent's $110K AI Server Challenges Nvidia's Inference Dominance

5 min read

Jim Keller built Apple's A4 processor, architected AMD Zen, and designed Tesla's autonomous driving chip. On May 1, 2026, he deployed his latest creation into production: a $110,000 box that Tenstorrent claims does large-language-model inference faster and cheaper than the systems every cloud operator and AI company has been buying from Nvidia.

That box is the Galaxy Blackhole. It is not a graphics processor or a specialized inference accelerator. It is a 32-chip integrated inference server running fully open-source software on RISC-V architecture, housed in a single air-cooled 4U chassis, with 6.2 gigabytes of on-chip SRAM and up to 56 network ports. Tenstorrent announced general availability on April 28 and held a public launch event two days later. Sixteen units are already deployed at Equinix's data center in Ashburn, Virginia. Turiyam, an image-as-service startup, is deploying 32 of them in India. Cirrascale, which operates AI clouds for enterprises, has ordered them. Prodia, a video-generation company, benchmarked one and says it generates 720p video 81 frames long in 2.4 seconds. The company is backed by Bezos Expeditions, Samsung, LG Electronics, Hyundai Motor Group, and Fidelity, and has raised over $1 billion.

The Galaxy Blackhole's specifications read like a direct response to the inference market's current architecture. A single node delivers 23 petaflops of Block FP8 compute. A supercluster of four systems starts at $440,000. For context: Nvidia's eight-way DGX boxes cost between three and five times that amount per node. Tenstorrent claims its system hits 350+ tokens per second per user on DeepSeek-R1-0528, a 671-billion-parameter model, with sub-4-second time-to-first-token latency in what the company calls Blitz Mode. EE Times, testing the system before launch, recorded 255 tokens per second per user on shorter chatbot prompts, a 27 percent gap from Tenstorrent's headline claim, though still respectable for an inference box at that price point.

This matters because the inference market is fracturing along architectural lines. Nvidia spent $20 billion acquiring Groq, a startup that makes inference accelerators optimized for token generation. The winning pattern, according to the industry, is disaggregation: pair GPUs, which excel at matrix multiplication during prompt processing (prefill), with specialized inference chips like Groq's LPUs, which excel at one-token-at-a-time generation (decode). It is fast. It scales. It locks customers into Nvidia's ecosystem and Groq's silicon. Tenstorrent is betting the opposite: that a single integrated system of its own chips can handle both prefill and decode well enough that customers will prefer the simpler software story and open-source stack. Keller said in an EE Times interview: 'We started making the Galaxy Blackhole servers in January, and somewhere along the way, we started to realize just how fast Tenstorrent AI is. We do something no one else does, where we can hook up a lot of medium-performance chips together in a Galaxy box, then hook those together, and scale applications across multiple Galaxies.' The company claims 90 percent of models from HuggingFace run on Tenstorrent hardware without modification.

The economics are what matter here. Inference is becoming the profit center of AI deployment, not training. Cloud operators are buying hardware by the thousands. Margins matter. If Tenstorrent's Galaxy can genuinely deliver similar throughput and latency at one-fourth to one-third the cost, and if the open-source software stack actually reduces operational friction and vendor lock-in, then Nvidia's inference margins face real pressure. Nvidia is not defenseless: its installed base, CUDA ecosystem, and engineering depth are formidable. But the company's inference play now depends on the aggregated value of Groq's decode performance, not just GPU throughput. If Tenstorrent's integrated approach proves good enough for 80 percent of use cases at a quarter of the price, that is a different market. Groq, meanwhile, becomes a premium option, not a necessity. Dave Driggers, CEO of Cirrascale, said: 'We evaluate a lot of hardware. Most of it is incremental. Tenstorrent Galaxy Blackhole is not. Tenstorrent has taken a clean-sheet approach to AI infrastructure, and the results speak for themselves.'

Here is what is actually happening: Tenstorrent is credible. Keller's track record is real. The company has real funding, real customers, and real deployments. But the gap between its 350+ t/s claim and EE Times' 255 t/s measurement is meaningful. The independent test used shorter prompts, which may explain some variance, prefill behaves differently than decode, but a 27 percent shortfall is not margin of error. That gap becomes the first test of whether Tenstorrent is genuinely faster or just cheaper with acceptable performance. If the gap widens as the company scales or as customers stress the systems with production workloads, the narrative collapses. If the gap narrows or is explained by methodology, Tenstorrent becomes the default option for anyone not locked into CUDA. The real read is that Tenstorrent has built something genuinely useful and is priced to win, but the company is not yet proven at scale. The Equinix deployment and Prodia benchmarks are encouraging signals. The independent testing reveals honest engineering but not yet dominance. Watch whether additional major cloud operators announce deployments in Q2 and Q3 2026, and whether frontier model support expands beyond what has been publicly announced.

Three concrete things will tell the story: First, independent benchmarking of Blitz Mode performance against Nvidia GB300 paired with Groq LPU configurations on identical frontier models, the company's own claims need third-party validation at scale. Second, customer announcements from Tier 1 cloud operators: if AWS, Google Cloud, or Azure add Galaxy to their inference menus, Tenstorrent has won mindshare. Third, software compatibility breadth: the 90 percent HuggingFace claim is useful, but deployment depends on supporting the actual models customers want to run, Moonshot AI's Kimi K2, Anthropic's Claude, OpenAI's latest, and custom fine-tuned variants. Watch for those explicit commitments.

Key Takeaways

Jim Keller's Tenstorrent is shipping integrated RISC-V inference boxes at $110K per node, not disaggregated GPU+LPU hybrids, and early customer deployments are already live
Galaxy Blackhole hits 350+ tokens/second/user on 671B models, but independent testing found 255 t/s, a 27% gap worth tracking as the company scales
The architecture bets against Nvidia's Groq acquisition strategy: instead of pairing GPUs for prefill with specialized inference chips for decode, Tenstorrent does both from the same silicon

What it meansTenstorrent's price-to-performance ratio and open-source software stack create a credible alternative to GPU-centric inference, putting pressure on Nvidia's inference margin and giving customers a non-CUDA path to scale.

DISCLAIMER

This article is for informational purposes only and does not constitute financial, investment, legal, or tax advice.

SOURCES

EE Times, Tenstorrent Galaxy Blackhole AI Inference Server Launch

← Back to HyperSinc