Building and serving models on infrastructure is a strong use case for businesses. In Google Cloud, you have the ability to design your AI infrastructure to suit your workloads. Recently, I experimented with Google Kubernetes Engine (GKE) managed DRANET while deploying a model for inference with NVIDIA B200 GPUs on GKE. In this blog, we will explore this setup in easy to follow steps.
What is DRANET
Dynamic Resource Allocation (DRA) is a feature that lets you request and share resources among Pods. DRANET allows you to request and allocate networking resources for your Pods, including network interfaces that support TPUs & Remote Direct Memory Access (RDMA). In my case, the use of high-end GPUs.
How GPU RDMA VPC works
The RDMA network is set up as an isolated VPC, which is regional and assigned a network profile type. In this case, the network profile type is RoCEv2. This VPC is dedicated for GPU-to-GPU communication. The GPU VM families have RDMA capable NICs that connect to the RDMA VPC. The GPUs communicate between multiple nodes via this low latency, high speed rail aligned setup.
Design pattern example
Our aim was to deploy a LLM model (Deepseek) onto a GKE cluster with A4 nodes that support 8 B200 GPUs and serve it via GKE Inference gateway privately. To set up an AI Hypercomputer GKE cluster, you can use the Cluster Toolkit, but in my case, I wanted to test the GKE managed DRANET dynamic setup of the networking that supports RDMA for the GPU communication.
This design utilizes the following services to provide an end-to-end solution:
-
VPC: Total of 3 VPC. One VPC manually created, two created automatically by GKE managed DRANET, one standard and one for RDMA.
-
GKE: To deploy the workload.
-
GKE Inference gateway: To expose the workload internally using a regional internal Application Load Balancers type gke-l7-rilb.
-
A4 VM’s: These support RoCEv2 with NVIDIA B200 GPU.
Putting it together
To get access to the A4 VM a future reservation was used. This is linked to a specific zone.
Begin: Set up the environment
-
Create a standard VPC, with firewall rules and subnet in the same zone as the reservation.
-
Create a proxy-only subnet this will be used with the Internal regional application load balancer attached to the GKE inference gateway
Next: Create a standard GKE cluster node and default node pool.
Once that is complete you can connect to your cluster:
Create a GPU node pool (this example uses, A4 VM with reservation) and additionals flags:
-
---accelerator-network-profile=auto (GKE automatically adds the gke.networks.io/accelerator-network-profile: auto label to the nodes)
--node-labels=cloud.google.com/gke-networking-dra-driver=true (Enables DRA for high-performance networking)
Next: Create a ResourceClaimTemplate, which will be used to attach the networking resources to your deployments. The deviceClassName: mrdma.google.com is used for GPU workloads:
Deploy model and inference
Now that a cluster and node pool is setup, we can deploy a model and serve it via Inference gateway. In my experiment I used DeepSeek but this could be any model.
Deploy model and services
-
The nodeSelector: gke.networks.io/accelerator-network-profile: auto is used to assign to the GPU node
-
The resourceClaims: attaches the resource we defined for networking
Create a secret (I used Hugging Face token):
Deployment
Create Inference pool
Create the Gateway, HTTPRoute and InferenceObjective
Once complete, you can create a test VM in your main VPC and make a call to the IP address of the GKE Inference Gateway:
Posted in