Inside the world’s most powerful AI datacenter

6 months ago 96

This week we person introduced a question of purpose-built datacenters and infrastructure investments we are making astir the satellite to enactment the planetary adoption of cutting-edge AI workloads and unreality services.

Today successful Wisconsin we introduced Fairwater, our newest US AI datacenter, the largest and astir blase AI mill we’ve built yet. In summation to our Fairwater datacenter successful Wisconsin, we besides person aggregate identical Fairwater datacenters nether operation successful different locations crossed the US.

In Narvik, Norway, Microsoft announced plans with nScale and Aker JV to make a caller hyperscale AI datacenter.

In Loughton, UK, we announced a concern with nScale to physique the UK’s largest supercomputer to enactment services successful the UK.

These AI datacenters are important superior projects, representing tens of billions of dollars of investments and hundreds of thousands of cutting-edge AI chips, and volition seamlessly link with our planetary Microsoft Cloud of implicit 400 datacenters successful 70 regions astir the world. Through innovation that tin alteration america to nexus these AI datacenters successful a distributed network, we multiply the ratio and compute successful an exponential mode to further democratize entree to AI services globally.

So what is an AI datacenter?

The AI datacenter: the caller mill of the AI era

Aerial presumption    of Microsoft's caller   AI datacenter field  successful  Mt. Pleasant, Wisconsin. Aerial presumption of Microsoft’s caller AI datacenter field successful Mt Pleasant, Wisconsin.

An AI datacenter is simply a unique, purpose-built installation designed specifically for AI grooming arsenic good arsenic moving large-scale artificial quality models and applications. Microsoft’s AI datacenters powerfulness OpenAI, Microsoft AI, our Copilot capabilities and galore much starring AI workloads.

The caller Fairwater AI datacenter successful Wisconsin stands arsenic a singular feat of engineering, covering 315 acres and lodging 3 monolithic buildings with a combined 1.2 cardinal quadrate feet nether roofs. Constructing this installation required 46.6 miles of heavy instauration piles, 26.5 cardinal pounds of structural steel, 120 miles of medium-voltage underground cablegram and 72.6 miles of mechanical piping.

Unlike emblematic unreality datacenters, which are optimized to tally galore smaller, autarkic workloads specified arsenic hosting websites, email oregon concern applications, this datacenter is built to enactment arsenic 1 monolithic AI supercomputer utilizing a azygous level networking interconnecting hundreds of thousands of the latest NVIDIA GPUs. In fact, it volition present 10X the show of the world’s fastest supercomputer today, enabling AI grooming and inference workloads astatine a level ne'er earlier seen.

The relation of our AI datacenters – powering frontier AI

Effective AI models trust connected thousands of computers moving together, powered by GPUs, oregon specialized AI accelerators, to process monolithic concurrent mathematical computations. They’re interconnected with highly accelerated networks truthful they tin stock results instantly, and each of this is supported by tremendous retention systems that clasp the information (like text, images oregon video) breached down into tokens, the tiny units of accusation the AI learns from. The extremity is to support these chips engaged each the time, due to the fact that if the information oregon the web can’t support up, everything slows down.

The AI grooming itself is simply a cycle: the AI processes tokens successful sequence, makes predictions astir the adjacent one, checks them against the close answers and adjusts itself. This repeats trillions of times until the strategy gets amended astatine immoderate it’s being trained to do. Think of it similar a nonrecreational shot team’s practice. Each GPU is simply a subordinate moving a drill, the tokens are the plays being executed measurement by step, and the web is the coaching staff, shouting instructions and keeping everyone successful sync. The squad repeats plays implicit and over, correcting mistakes until they tin execute them perfectly. By the end, the AI model, similar the team, has mastered its strategy and is acceptable to execute nether existent crippled conditions.

AI infrastructure astatine frontier scale

Purpose-built infrastructure is captious to being capable to powerfulness AI efficiently. To compute the token mathematics astatine this trillion-parameter standard of starring AI models, the halfway of the AI datacenter is made up of dedicated AI accelerators (such arsenic GPUs) mounted connected server boards alongside CPUs, representation and storage. A azygous server hosts aggregate GPU accelerators, connected for high-bandwidth communication. These servers are past installed into a rack, with top-of-rack (ToR) switches providing low-latency networking betwixt them. Every rack successful the datacenter is interconnected, creating a tightly coupled cluster. From the outside, this architecture looks similar galore autarkic servers, but astatine standard it functions arsenic a azygous supercomputer wherever hundreds of thousands of accelerators tin bid a azygous exemplary successful parallel.

This datacenter runs a single, monolithic clump of interconnected NVIDIA GB200 servers and millions of compute cores and exabytes of storage, each engineered for the astir demanding AI workloads. Azure was the archetypal unreality supplier to bring online the NVIDIA GB200 server, rack and afloat datacenter clusters. Each rack packs 72 NVIDIA Blackwell GPUs, tied unneurotic successful a azygous NVLink domain that delivers 1.8 terabytes of GPU-to-GPU bandwidth and gives each GPU entree to 14 terabytes of pooled memory. Rather than behaving similar dozens of abstracted chips, the rack operates arsenic a single, elephantine accelerator, susceptible of processing an astonishing 865,000 tokens per second, the highest throughput of immoderate unreality level disposable today. The Norway and UK AI datacenters volition usage akin clusters, and instrumentality vantage of NVIDIAs adjacent AI spot plan (GB300) which offers adjacent much pooled representation per rack.

The situation successful establishing supercomputing scale, peculiarly arsenic AI grooming requirements proceed to necessitate breakthrough scales of computing, is getting the networking topology conscionable right. To guarantee debased latency connection crossed aggregate layers successful a unreality environment, Microsoft needed to widen show beyond a azygous rack. For the latest NVIDIA GB200 and GB300 deployments globally, astatine the rack level these GPUs pass implicit NVLink and NVSwitch astatine terabytes per second, collapsing representation and bandwidth barriers. Then to link crossed aggregate racks into a pod, Azure uses some InfiniBand and Ethernet fabrics that present 800 Gbps, successful a afloat abdominous histrion non-blocking architecture to guarantee that each GPU tin speech to each different GPU astatine afloat enactment complaint without congestion. And crossed the datacenter, aggregate pods of racks are interconnected to trim hop counts and alteration tens of thousands of GPUs to relation arsenic 1 global-scale supercomputer.

When laid retired successful a accepted datacenter hallway, carnal region betwixt racks introduces latency into the system. To code this, the racks successful the Wisconsin AI datacenter are laid retired successful a two-story datacenter configuration, truthful successful summation to racks networked to adjacent racks, they are networked to further racks supra oregon beneath them.

This layered attack sets Azure apart. Microsoft Azure was not conscionable the archetypal unreality to bring GB200 online astatine rack and datacenter scale; we’re doing it astatine monolithic standard with customers today. By co-engineering the afloat stack with the champion from our manufacture partners coupled with our ain purpose-built systems, Microsoft has built the astir powerful, tightly coupled AI supercomputer successful the world, purpose-built for frontier models.

A high-density clump   of AI infrastructure servers successful  a Microsoft datacenter.High density clump of AI infrastructure servers successful a Microsoft datacenter.

Addressing the biology impact: closed loop liquid cooling astatine installation scale

Traditional aerial cooling can’t grip the density of modern AI hardware. Our datacenters usage precocious liquid cooling systems — integrated pipes circulate acold liquid straight into servers, extracting vigor efficiently. The closed-loop recirculation ensures zero h2o waste, with h2o lone needed to capable up erstwhile and past it is continually reused.

By designing purpose-built AI datacenters, we were capable to physique liquid cooling infrastructure into the installation straight to get america much rack-density successful the datacenter. Fairwater is supported by the 2nd largest water-cooled chiller works connected the satellite and volition continuously circulate h2o successful its closed loop cooling system. The blistery h2o is past piped retired to the cooling “fins” connected each broadside of the datacenter, wherever 172 20-foot fans chill and recirculate the h2o backmost to the datacenter. This strategy keeps the AI datacenter moving efficiently, adjacent astatine highest loads.

An aerial presumption    of portion  of the closed loop liquid cooling system.Aerial presumption of portion of the closed loop liquid cooling system.

Over 90% of our datacenter capableness uses this system, requiring h2o lone erstwhile during operation and continually reusing it with nary evaporation losses. The remaining 10% of accepted servers usage outdoor aerial for cooling, switching to h2o lone during the hottest days, a plan that dramatically reduces h2o usage compared to accepted datacenters.

We’re besides utilizing liquid cooling to enactment AI workloads successful galore of our existing datacenters; this liquid cooling is accomplished with Heat Exchanger Units (HXUs) that besides run with zero-operational h2o use.

Storage and compute: Built for AI velocity

Modern datacenters tin incorporate exabytes of retention and millions of CPU compute scores. To enactment the AI infrastructure cluster, an wholly abstracted datacenter infrastructure is needed to store and process the information utilized and generated by the AI cluster. To springiness you an illustration of the standard — the Wisconsin AI datacenter’s retention systems are 5 shot fields successful length!

An aerial presumption    of a dedicated retention  and compute datacenter utilized  to store   and process   information  for the AI datacenter.Aerial presumption of a dedicated retention and compute datacenter utilized to store and process information for the AI datacenter.

We reengineered Azure retention for the astir demanding AI workloads, crossed these monolithic datacenter deployments for existent supercomputing scale. Each Azure Blob Storage relationship tin prolong implicit 2 cardinal read/write transactions per second, and with millions of accounts available, we tin elastically standard to conscionable virtually immoderate information requirement.

Behind this capableness is simply a fundamentally rearchitected retention instauration that aggregates capableness and bandwidth crossed thousands of retention nodes and hundreds of thousands of drives. This enables standard to exabyte standard storage, eliminating the request for manual sharding and simplifying operations for adjacent the largest AI and analytics workloads.

Key innovations specified arsenic BlobFuse2 present high-throughput, low-latency entree for GPU node-local training, ensuring that compute resources are ne'er idle and that monolithic AI grooming datasets are ever disposable erstwhile needed. Multiprotocol enactment allows seamless integration with divers information pipelines, portion heavy integration with analytics engines and AI tools accelerates information mentation and deployment.

Automatic scaling dynamically allocates resources arsenic request grows, combined with precocious security, resiliency and cost-effective tiered storage, Azure’s retention level sets the gait for next-generation workloads, delivering the performance, scalability and reliability required.

AI WAN: Connecting aggregate datacenters for an adjacent larger AI supercomputer

These caller AI datacenters are portion of a planetary web of Azure AI datacenters, interconnected via our Wide Area Network (WAN). This isn’t conscionable astir 1 building, it’s astir a distributed, resilient and scalable strategy that operates arsenic a single, almighty AI machine. Our AI WAN is built with maturation capabilities successful AI-native bandwidth scales to alteration large-scale distributed grooming crossed multiple, geographically divers Azure regions, frankincense allowing customers to harness the powerfulness of a elephantine AI supercomputer.

This is simply a cardinal displacement successful however we deliberation astir AI supercomputers. Instead of being constricted by the walls of a azygous facility, we’re gathering a distributed strategy wherever compute, retention and networking resources are seamlessly pooled and orchestrated crossed datacenter regions. This means greater resiliency, scalability and flexibility for customers.

Bringing it each together

To conscionable the captious needs of the largest AI challenges, we needed to redesign each furniture of our unreality infrastructure stack. This isn’t conscionable astir isolated breakthroughs, but composing aggregate caller approaches crossed silicon, servers, networks and datacenters, starring to advancements wherever bundle and hardware are optimized arsenic 1 purpose-built system.

Microsoft’s Wisconsin datacenter volition play a captious relation successful the aboriginal of AI, built connected existent technology, existent concern and existent assemblage impact. As we link this installation with different determination datacenters, and arsenic each furniture of our infrastructure is harmonized arsenic a implicit system, we’re unleashing a caller epoch of cloud-powered intelligence, secure, adaptive and acceptable for what’s next.

To larn much astir Microsoft’s datacenter innovations, cheque retired the virtual datacenter circuit at datacenters.microsoft.com.

Scott Guthrie is liable for hyperscale unreality computing solutions and services including Azure, Microsoft’s unreality computing platform, generative AI solutions, information platforms and accusation and cybersecurity. These platforms and services assistance organizations worldwide lick urgent challenges and thrust semipermanent transformation.

The station Inside the world’s astir almighty AI datacenter appeared archetypal connected Microsoft Azure Blog.

Read Entire Article