Accelerating open-source infrastructure development for frontier AI at scale

6 months ago 93

Microsoft is contributing caller standards crossed power, cooling, sustainability, security, networking, and fleet resiliency to beforehand innovation.

In the modulation from gathering computing infrastructure for unreality standard to gathering unreality and AI infrastructure for frontier scale, the satellite of computing has experienced tectonic shifts successful innovation. Throughout this journey, Microsoft has shared its learnings and champion practices, optimizing our unreality infrastructure stack successful cross-industry forums specified arsenic the Open Compute Project (OCP) Global Foundation.

Today, we spot that the adjacent signifier of unreality infrastructure innovation is poised to beryllium the astir consequential play of translation yet. In conscionable the past year, Microsoft has added much than 2 gigawatts of caller capableness and launched the world’s astir almighty AI datacenter, which delivers 10x the show of the world’s fastest supercomputer today. Yet, this is conscionable the beginning.

Delivering AI infrastructure astatine the highest show and lowest outgo requires a systems approach, with optimizations crossed the stack to thrust quality, speed, and resiliency astatine a level that tin supply a accordant acquisition to our customers. In the quest to proviso resilient, sustainable, secure, and wide scalable exertion to grip the breadth of AI workloads, we’re embarking connected an ambitious caller journey: 1 not conscionable of redefining infrastructure innovation astatine each furniture of execution from silicon to systems, but 1 of tightly integrated manufacture alignment connected standards that connection a exemplary for planetary interoperability and standardization.

At this year’s OCP Global Summit, Microsoft is contributing caller standards crossed power, cooling, sustainability, security, networking, and fleet resiliency to further beforehand innovation successful the industry.

Redefining powerfulness organisation for the AI era

As AI workloads standard globally, hyperscale datacenters are experiencing unprecedented powerfulness density and organisation challenges.

Last year, astatine the OCP Global Summit, we partnered with Meta and Google successful the improvement of Mt. Diablo, a disaggregated powerfulness architecture. This year, we’re building connected this innovation with the adjacent measurement of our full-stack transformation of datacenter powerfulness systems: solid-state transformers. Solid-state transformers simplify the powerfulness concatenation with caller conversion technologies and extortion schemes that tin accommodate aboriginal rack voltage requirements.

Training ample models crossed thousands of GPUs besides introduces adaptable and aggravated powerfulness gully patterns that tin strain the grid. The utility, and accepted powerfulness transportation systems. These fluctuations not lone hazard hardware reliability and operational ratio but besides make challenges crossed capableness readying and sustainability goals.

Together with cardinal manufacture partners, Microsoft is starring a powerfulness stabilization inaugural to code this challenge. In a precocious published insubstantial with OpenAI and NVIDIA—Power Stabilization for AI Training Datacenters—we code however full-stack innovations spanning rack-level hardware, firmware orchestration, predictive telemetry, and installation integration tin creaseless powerfulness spikes, trim powerfulness overshoot by 40%, and mitigate operational hazard and costs to alteration predictable, and scalable powerfulness transportation for AI grooming clusters.

This year, astatine the OCP Global Summit, Microsoft is joining forces with manufacture partners to motorboat a dedicated powerfulness stabilization workgroup. Our extremity is to foster unfastened collaboration crossed hyperscalers and hardware partners, sharing our learnings from full-stack innovation and inviting the assemblage to co-develop caller methodologies that code the unsocial powerfulness challenges of AI grooming datacenters. By gathering connected the insights from our precocious published achromatic paper, we purpose to accelerate industry-wide adoption of resilient, scalable powerfulness transportation solutions for the adjacent procreation of AI infrastructure. Read much astir our powerfulness stabilization efforts.

Cooling innovations for resiliency

As the powerfulness illustration for AI infrastructure changes, we are besides continuing to rearchitect our cooling infrastructure to enactment evolving needs astir vigor consumption, abstraction optimization, and wide datacenter sustainability. Various cooling solutions indispensable beryllium implemented to enactment the standard of our expansion—as we question to physique caller AI-scale datacenters, we are besides utilizing Heat Exchanger Unit (HXU)-based liquid cooling to rapidly deploy caller AI capableness wrong our existing air-cooled datacenter footprint.

Microsoft’s adjacent procreation HXU is an upcoming OCP publication that enables liquid cooling for high-performance AI systems successful air-cooled datacenters, supporting planetary scalability and accelerated deployment. The modular HXU plan delivers 2X the show of existent models and maintains >99.9% cooling work availability for AI workloads. No datacenter modifications are required, allowing seamless integration and expansion. Learn much astir the adjacent procreation HXU here. 

Meanwhile, we’re continuing to innovate crossed aggregate layers of the stack to code changes successful powerfulness and vigor dissipation—utilizing installation h2o cooling astatine datacenter-scale, circulating liquid in closed-loops from server to chiller; and exploring on-chip cooling innovations similar microfluidics to efficiently region vigor straight from the silicon.

Unified networking solutions for increasing infrastructure demands 

Scaling hundreds of thousands of GPUs to run arsenic a single, coherent strategy comes with important challenges to make rack-scale interconnects that tin present low-latency, precocious bandwidth fabrics that are some businesslike and interoperable. As AI workloads turn exponentially and infrastructure demands intensify, we are exploring networking optimizations that tin enactment these needs. To that end, we person developed solutions leveraging scale-up, scale-out, and Wide Area Network (WAN) solutions to alteration large-scale distributed training.

We spouse intimately with standards bodies, similar UEC (Ultra Ethernet Consortium) and UALink, focused connected innovation successful networking technologies for this captious constituent of AI systems. We are besides driving guardant adoption of Ethernet for scale-up networking crossed the ecosystem and are excited to spot the Ethernet for Scale-up Networking (ESUN) workstream motorboat nether the OCP Networking Project. We look guardant to promoting adoption of cutting-edge networking solutions and enabling multi-vendor Ecosystem based connected unfastened standards.

Security, sustainability, and quality: Fundamental pillars for resilient AI operations

Defense successful depth: Trust astatine each layer

Our broad attack to scaling AI systems responsibly includes embedding spot and information into each furniture of our platform. This year, we are introducing caller information contributions that physique connected our existing assemblage of enactment successful hardware information and present caller protocols that are uniquely acceptable to enactment caller technological breakthroughs that person been accelerated with the instauration of AI:

  • Building connected past years’ contributions and Microsoft’s collaboration with AMD, Google, and NVIDIA, we person further enhanced Caliptra, our open-source silicon basal of spot The instauration of Caliptra 2.1 extends the hardware root-of-trust to a afloat information subsystem. Learn much astir Caliptra 2.1 here.
  • We person besides added Adams Bridge 2.0 to Caliptra to widen enactment for quantum-resilient cryptographic algorithms to the root-of-trust.
  • Finally, we are contributing OCP Layered Open-source Cryptographic Key Management (L.O.C.K)—a cardinal absorption artifact for retention devices that secures media encryption keys successful hardware. L.O.C.K was developed done collaboration betwixt Google, Kioxia, Microsoft, Samsung, and Solidigm.

Advancing datacenter-scale sustainability 

Sustainability continues to beryllium a large country of accidental for manufacture collaboration and standardization done communities specified arsenic the Open Compute Project. Working collaboratively arsenic an ecosystem of hyperscalers and hardware partners is 1 catalyst to code the request for sustainable datacenter infrastructure that tin efficaciously standard arsenic compute demands proceed to evolve. This year, we are pleased to proceed our collaborations arsenic portion of OCP’s Sustainability workgroup crossed areas specified arsenic c reporting, accounting, and circularity:

  • Announced astatine this year’s Global Summit, we are partnering with AWS, Google, and Meta to money the Product Category Rule inaugural nether the OCP Sustainability workgroup, with the extremity of standardizing c measurement methodology for devices and datacenter equipment.
  • Together with Google, Meta, OCP, Schneider Electric, and the iMasons Climate Accord, we are establishing the Embodied Carbon Disclosure Base Specification to found a communal model for reporting the c interaction of datacenter equipment.
  • Microsoft is advancing the adoption of discarded vigor reuse (WHR). In concern with the NetZero Innovation Hub, NREL, and EU and US collaborators, Microsoft has published heat reuse notation designs and is processing an economical modeling instrumentality which supply information halfway operators and discarded vigor disconnected takers/consumers the outgo it takes to make the discarded vigor reuse infrastructure based connected the conditions similar the size and capableness of the WHR system, season, location, WHR mandates and subsidies successful place. These region-specific solutions assistance operators person excess vigor into usable energy—meeting regulatory requirements and unlocking caller capacity, particularly successful regions similar Europe wherever vigor reuse is becoming mandatory.
  • We person developed an unfastened methodology for Life Cycle Assessment (LCA) astatine standard crossed large-scale IT hardware fleets to thrust towards a “gold standard” successful sustainable unreality infrastructure.

Rethinking node management: Fleet operational resiliency for the frontier era

As AI infrastructure scales astatine an unprecedented pace, Microsoft is investing successful standardizing however divers compute nodes are deployed, updated, monitored, and serviced crossed hyperscale datacenters. In collaboration with AMD, Arm, Google, Intel, Meta, and NVIDIA, we are driving a bid of Open Compute Project (OCP) contributions focused connected streamlining fleet operations, unifying firmware management, manageability interfaces and enhancing diagnostics, debug, and RAS (Reliability, Availability, and Serviceability) capabilities. This standardized attack to lifecycle absorption lays the instauration for consistent, scalable node operations during this play of accelerated expansion. Read much astir our attack to resilient fleet operations

Paving the mode for frontier-scale AI computing 

As we participate a caller epoch of frontier-scale AI development, Microsoft takes pridefulness successful starring the advancement of standards that volition thrust the aboriginal of globally deployable AI supercomputing. Our committedness is reflected successful our progressive relation in shaping the ecosystem that enables scalable, secure, and reliable AI infrastructure crossed the globe. We invitation attendees of this year’s OCP Global Summit to link with Microsoft astatine booth #B53 to observe our latest unreality hardware demonstrations. These demonstrations showcase our ongoing collaborations with partners passim the OCP community, highlighting innovations that enactment the improvement of AI and unreality technologies.

Connect with Microsoft astatine the OCP Global Summit 2025 and beyond

Read Entire Article