Inside the AI Factory - Part 1: How Modern Data Centers Power the Age of Inference
AI inference has turned data centers into industrial-scale factories, built on dense accelerators, high-bandwidth memory, liquid cooling and optical fabrics. The race to serve tokens efficiently is reshaping supply chains from GPUs to fiber switches.
The modern AI inferencing data center is less a warehouse of servers than a factory floor for tokens, engineered to turn prompts into answers at industrial scale. The sheer size of the build-out tells the story: OpenAI and Nvidia map a path to multibillion-dollar “AI factories” measured in gigawatts, a signal that inference—the always-on act of serving models to users—is becoming critical infrastructure in its own right.
At the heart of these factories sits the accelerator complex. Nvidia’s Blackwell generation reframes the unit of compute from a single card to a rack-scale appliance: the GB200 NVL72 stitches seventy-two GPUs and thirty-six Grace CPUs into one liquid-cooled domain, presenting itself to software as a giant, low-latency pool of memory and math for real-time, trillion-parameter inference. Nvidia promises sharp step-ups in performance with the new FP4 format in the Transformer Engine. The larger point is architectural—move more of the model’s work onto tightly coupled silicon and shorten every wire in sight. Power and cooling define the box: engineering notes peg the rack around 120 kilowatts, cooled by direct liquid loops that would have looked exotic in a cloud data center not long ago.
The competitive lane is real, if narrower. AMD’s Instinct MI350 series has leaned into memory capacity—288 gigabytes of HBM3E per device—to feed long sequences and memory-hungry retrieval. Intel’s Gaudi 3, built around Ethernet fabrics rather than proprietary links, sells a value story on inference efficiency. Both are now shipping through mainstream server makers, broadening the bill of materials beyond Nvidia even as Blackwell sets the performance narrative.
All of this pivots on memory. Inference performance hinges not just on flops but on bringing model weights and key-value caches close to compute, which is why high-bandwidth memory is the scarcest ingredient in the kitchen. SK hynix remains the lead supplier as the industry completes the HBM3E cycle and lines up HBM4; Micron pushed HBM3E into volume early for Nvidia’s H200 and has booked forward output, while Samsung has lately cleared a high-profile Nvidia qualification hurdle. Capacity is catching up after a year of sellouts, but the supply chain is still the gating factor for how fast inference clusters can grow.
The nervous system is the network, and here a decisive shift is under way. Back-end fabrics inside inference clusters are standardizing on 800-gigabit Ethernet today with a quick march to 1.6 terabits per port, riding 51.2-terabit switch ASICs from Broadcom, Cisco and others. Arista’s current 800G leaf-and-spine boxes, Cisco’s Silicon One G200, and Broadcom’s Tomahawk 5—available in both pluggable-optics and co-packaged variants—anchor the merchant silicon landscape. The thesis is simple: make Ethernet behave like a low-tail-latency, loss-managed fabric for AI by tightening the stack, then amortize that at hyperscale. That is the promise behind the Ultra Ethernet Consortium and Nvidia’s own Spectrum-X line, which pushes Ethernet features that mimic InfiniBand’s determinism. If the industry delivers, inference operators get the scale economics and supply flexibility of Ethernet without giving up training-class reliability.
Optics, once an afterthought, now sits at center stage because every additional meter of copper punishes power and reach. Two trends define the next wave. First is the rise of linear-drive pluggable optics (LPO), which skip power-hungry Digital Signal Processors (DSP) in short-reach links and cut watts per bit across 800G and emerging 1.6T modules. The second is co-packaged optics, where lasers snuggle up to the switch ASIC to collapse electrical flight paths entirely. Demonstrations at OFC and vendor roadmaps suggest both will coexist: LPO to trim the fat at the top-of-rack and leaf layers, CPO to break the next bandwidth wall in switch cores. The new optics vendors’ roster—Lumentum, Coherent, Eoptolink and others—reads like a who’s who of 200-gigabit-per-lane evangelists.
A deceptively powerful building block is the optical circuit switch. Products such as Huber+Suhner’s Polatis platforms add a reconfigurable, software-driven fiber cross-connect to the data center, letting operators reroute light without dropping to electrical gear. In practice, that supports fast lab automation, lower-loss re-patching in AI pods, and the kind of dynamic topologies that training clusters and large inference farms increasingly need. As port counts climb into the hundreds per chassis, the appeal is less about headline bandwidth than about utilization and opex.
Compute adjacency matters too. The rise of DPUs and SuperNICs is really a fight over who handles the drudgery that steals cycles from tokens. Nvidia’s BlueField line, its new ConnectX-8 SuperNIC, AMD’s Pensando silicon and Intel’s IPUs all aim to terminate storage and networking, police congestion and security, and deliver predictable latency to the accelerators. Offloading those chores is not glamorous, but it is how clusters keep GPUs busy and service-level objectives intact.
Storage has quietly modernized around inference realities. Hot weights and embeddings like fast object stores and NVMe-over-Fabrics, with file systems recast as streaming engines for models rather than long-lived data lakes. Vendors such as VAST have chased explicit certifications with Nvidia’s partner ecosystem and built features like model streaming to shrink cold-start times when tenants swap models in and out. Legacy incumbents have followed with AI-tuned data platforms to keep up. The economics are straightforward: faster model loads and higher GPU utilization reduce cost per million tokens in production.
A second lever on those economics is memory disaggregation. The CXL 3.x standard now enables pooled, fabric-attached memory with the coherency and multi-level switching needed to be useful beyond one chassis, and a cottage industry of controllers and smart cables has sprung up to make it real. For inference, the prize is tamping down DRAM stranded cost and stretching context windows without buying more accelerators. Early proofs from vendors such as Astera Labs point to meaningful throughput gains in recommenders and chatbots when caches can spill into fast external memory.
None of this runs in thin air. Power delivery has standardized on 48-volt Open Rack V3 backplanes, with thousand-amp connectors feeding dense shelves and in-row coolant distribution units doing the thermal heavy lifting. Liquid’s energy math is finally compelling at the rack: studies from cooling vendors working with Nvidia show material reductions in total facility power versus fully air-cooled rooms once densities pass the Blackwell class. The frontier is getting even more intimate as Microsoft and partners experiment with microfluidic channels etched into the silicon itself, a lab-to-fab idea that, if it scales, would let data centers pack more compute into the same footprint.
The software stack is evolving just as fast. Production inference now assumes a serving layer that understands quantization, batching and cache topology as business logic. Nvidia’s Triton has become the default in many enterprises, while open-source vLLM popularized paged attention to keep key-value caches from blowing out GPU memory. TensorRT-LLM leans into Blackwell’s FP4 path, mixing precisions layer-by-layer to reclaim accuracy while keeping throughput high. The flip side of standardization is exposure: recent disclosures around Triton vulnerabilities were a reminder that inference servers are just that—servers—and need the same patch hygiene and isolation as any web front end.
Suppliers span the familiar and the newly crucial. Nvidia still defines the pace on accelerators, interconnects and software; AMD and Intel are credible alternatives in specific inference footprints. Broadcom, Cisco and Marvell power the switching silicon inside Arista, HPE and others. Lumentum, Coherent and Eoptolink ship the optics; Huber+Suhner’s Polatis brings optical switching; Vertiv and peers plumb the coolant and power shelves; storage players like VAST compete to make model streaming and retrieval feel local. There is plenty of room for integration specialists—Supermicro, HPE and the ODMs—who can package these parts into validated “AI PODs” that enterprises can actually buy.
Two forces will shape the next twelve to twenty-four months. One is efficiency: FP4 inference, smarter schedulers, and memory pooling are squeezing more served tokens from each watt and each dollar. The other is optics: Ethernetization and co-packaged lasers are rewriting power budgets and failure modes from the top of rack to the spine. The winners in this cycle will be the builders who master both the physics and the software, who can prevent hot spots—thermals, network tails, security—and keep the factory humming. Inference, finally, is an operations business.
Author

Investment manager, forged by many market cycles. Learned a lasting lesson: real wealth comes from owning businesses with enduring competitive advantages. At Qmoat.com I share my ideas.
Sign up for QMoat newsletters.
Stay up to date with curated collection of our top stories.