Deploy production generative AI at the edge using Amazon EKS Hybrid Nodes with NVIDIA DGX

1 month ago 29

Modern generative AI applications require deployment closer to where data is generated and business decisions are made, but this creates new infrastructure challenges. Organizations in manufacturing, healthcare, finance, and telecommunications need to deliver low-latency, energy-efficient AI workloads at the edge while maintaining data locality and regulatory compliance. However, managing Kubernetes on-premises adds operational complexity that can slow down innovation.

You can use Amazon Elastic Kubernetes Service (Amazon EKS) Hybrid Nodes to address this by joining on-premises infrastructure to the Amazon EKS control plane as remote nodes. This allows you to accelerate AI workload deployment with consistent operational practices, while addressing latency, compliance, and data residency requirements. EKS Hybrid Nodes removes the complexity and burden of self-managing Kubernetes on-premises so that your team can focus on deploying AI applications and driving innovations. It provides unified workflows and tooling alongside centralized monitoring and enhanced observability across your distributed infrastructure.

EKS Hybrid Nodes enables you to deliver AI capabilities wherever your business demands, such as the following use cases:

Run low-latency services at on-premises locations, including real-time inference at the edge
Train models with data that must remain on-premises to meet regulatory compliance requirements
Deploy inference workloads near source data, such as Retrieval-Augmented Generation (RAG) applications using a local knowledge base
Repurpose existing hardware investment

This post demonstrates a real-world example of integrating EKS Hybrid Nodes with NVIDIA DGX Spark, a compact and energy-efficient GPU platform optimized for edge AI deployment. In this post we walk you through deploying a large language model (LLM) for low-latency generative AI inference on-premises, setting up node monitoring and GPU observability with centralized management through Amazon EKS. Although this post uses DGX Spark, the architecture and patterns discussed apply to other NVIDIA DGX systems or GPU platforms.

Solution overview

For this demo walkthrough, you create an EKS cluster with EKS Hybrid Nodes enabled, and connect an on-premises DGX Spark as a hybrid node. You install the NVIDIA GPU Operator for Kubernetes to provision GPU resources for the local generative AI inference. Then, you deploy an LLM on the hybrid nodes using NVIDIA NIM, which are a set of microservices optimized by NVIDIA for accelerated model deployment. You also set up the Amazon EKS Node Monitoring Agent (NMA) to monitor node health and detect GPU-specific issues. Finally, you integrate the NVIDIA Data Center GPU Manager (DCGM) Exporter with Amazon Managed Service for Prometheus and Amazon Managed Grafana to provide GPU metrics observability across hybrid nodes.

The following diagram presents a high-level overview of the architecture of our solution.

Figure 1: Hybrid architecture for deploying GenAI workloads on-premises or at the edge using Amazon EKS Hybrid Nodes with NVIDIA DGX

EKS Hybrid Nodes requires private network connectivity between your on-premises or edge environment and the AWS Region. This connectivity can be established using either AWS Direct Connect or AWS Site-to-Site VPN into your Amazon Virtual Private Cloud (Amazon VPC). The node and pod Classless Inter-Domain Routing (CIDR) blocks for your hybrid nodes and container workloads must be unique and routable across your network environment. You provide these CIDRs as the RemoteNodeNetwork and RemotePodNetwork values when creating the EKS cluster with hybrid nodes.

This walkthrough doesn’t cover hybrid networking prerequisites for EKS Hybrid Nodes. Go to the Amazon EKS user guide for the details.

Prerequisites

The following prerequisites are necessary to complete this solution:

Amazon VPC with two private and two public subnets, across two Availability Zones (AZs).
An EKS cluster with hybrid nodes enabled. Follow the Amazon EKS user guide to deploy.
On-premises compute nodes running a compatible operating system.
Private connectivity between the on-premises network and Amazon VPC (through VPN or Direct Connect).
Two routable RFC-1918 or CGNAT CIDR blocks for RemoteNodeNetwork and RemotePodNetwork.
Configure the on-premises firewall and the EKS cluster security groups to allow bi-directional communications between the Amazon EKS control plane and remote node and pod CIDRs, as per the networking prerequisites.
NVIDIA DGX (or other GPU-enabled) systems as hybrid nodes.
NVIDIA NGC account and API key for accessing NIMs, see the NVIDIA documentation.
The following tools:

Walkthrough

The following steps walk you through this solution.

Prepare EKS Hybrid Nodes

The following three sections walk you through preparations for EKS Hybrid Nodes.

Prepare IAM credentials

Amazon EKS Hybrid Nodes use temporary AWS Identity and Access Management (IAM) credentials provisioned by AWS Systems Manager hybrid activations or IAM Roles Anywhere to authenticate with the EKS cluster. Follow the Amazon EKS user guide to create the required Hybrid Nodes IAM role (AmazonEKSHybridNodesRole) using either one of the two options.
Create an Amazon EKS access entry with the Hybrid Nodes IAM role to enable your on-premises nodes to join the cluster. Go to Prepare cluster access for hybrid nodes in the Amazon EKS user guide for more details.

aws eks create-access-entry \ --cluster-name <CLUSTER_NAME> \ --principal-arn <HYBRID_NODES_ROLE_ARN> \ --type HYBRID_LINUX

Install nodeadm and join the DGX Spark as hybrid node

Use EKS Hybrid Nodes CLI (nodeadm) to bootstrap and install all required components for your hybrid nodes to join the EKS cluster. This demo uses the ARM64 version of the nodeadm for the DGX Spark.

curl -OL 'https://hybrid-assets.eks.amazonaws.com/releases/latest/bin/linux/arm64/nodeadm' chmod +x nodeadm nodeadm install 1.34 --credential-provider ssm

Prepare a nodeConfig.yaml configuration file using the temporary IAM credentials generated in the previous section. The following is an example for using Systems Manager hybrid activations for hybrid nodes credentials.

apiVersion: node.eks.aws/v1alpha1 kind: NodeConfig spec: cluster: name: <CLUSTER_NAME> region: <CLUSTER_REGION> hybrid: ssm: activationCode: <SSM_ACTIVATION_CODE> activationId: <SSM_ACTIVATION_ID>

Run the nodeadm init command with your nodeConfig.yaml to join your hybrid nodes to the EKS cluster.

nodeadm init --config-source file://nodeConfig.yaml

For mixed GPU and non-GPU hybrid nodes, we recommend that you add a --register-with-taints=nvidia.com/gpu=Exists:NoSchedule taint to GPU nodes to maximize GPU resource usage. Refer to the documentation regarding how to modify the kubelet configuration using nodeadm.

Install Cilium Container Network Interface (CNI)

Before running workloads on hybrid nodes, you must install a compatible CNI. For this example, we use Cilium because it’s the AWS-supported CNI for EKS Hybrid Nodes.

Create a Cilium configuration file: cilium-values.yaml.

# BGP Control Plane for LoadBalancer services bgpControlPlane: enabled: true # NodePort services nodePort: enabled: true # Node affinity - Run Cilium only on hybrid nodes affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: eks.amazonaws.com/compute-type operator: In values: - hybrid # IPAM configuration for pod networking ipam: mode: cluster-pool operator: clusterPoolIPv4PodCIDRList: - 192.168.64.0/24 # RemotePodNetwork CIDR clusterPoolIPv4MaskSize: 25 # Cilium Operator configuration operator: rollOutPods: true unmanagedPodWatcher: restart: false affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: eks.amazonaws.com/compute-type operator: In values: - hybrid

Install Cilium on EKS Hybrid Nodes using Helm with the preceding configuration.

helm repo add cilium https://helm.cilium.io/ CILIUM_VERSION=1.18.6 helm install cilium cilium/cilium \ --version ${CILIUM_VERSION} \ --values cilium-values.yaml \ --namespace kube-system

If you’re running webhooks on hybrid nodes, then you must make sure that on-premises Pod CIDRs are routable across the hybrid network environment, using techniques such as BGP routing, static routing, or ARP proxying. This demo uses Cilium BGP control-plane to enable BGP peering between hybrid nodes and on-premises routers, and to advertise Pod CIDRs to the on-premises network.

Apply the following Cilium BGP configuration to your cluster.

--- apiVersion: cilium.io/v2 kind: CiliumBGPClusterConfig metadata: name: cilium-bgp spec: nodeSelector: matchExpressions: - key: eks.amazonaws.com/compute-type operator: In values: - hybrid bgpInstances: - name: "cilium-bgp" localASN: <NODES_ASN> peers: - name: "onprem-router" peerASN: <ONPREM_ROUTER_ASN> peerAddress: <ONPREM_ROUTER_IP> peerConfigRef: name: "cilium-peer" --- apiVersion: cilium.io/v2 kind: CiliumBGPPeerConfig metadata: name: cilium-peer spec: timers: holdTimeSeconds: 30 keepAliveTimeSeconds: 10 gracefulRestart: enabled: true restartTimeSeconds: 120 families: - afi: ipv4 safi: unicast advertisements: matchLabels: advertise: "bgp" --- apiVersion: cilium.io/v2 kind: CiliumBGPAdvertisement metadata: name: bgp-adv-pod labels: advertise: bgp spec: advertisements: - advertisementType: "PodCIDR"

Validate that your nodes are connected to the EKS cluster and in a Ready state.

$ kubectl get nodes -o wide -l eks.amazonaws.com/compute-type=hybrid NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME mi-0e06d30895cfcc155 Ready <none> 17d v1.34.2-eks-ecaa3a6 192.168.100.101 <none> Ubuntu 24.04.3 LTS 6.14.0-1015-nvidia containerd://2.2.1

Install NVIDIA GPU Operator for Kubernetes

The NVIDIA GPU Operator uses the Kubernetes operator framework to automate the lifecycle management of NVIDIA software components required to provision GPU resources. These components include the NVIDIA drivers (for enabling CUDA), Kubernetes device plugin for GPUs, the NVIDIA Container Toolkit, and DCGM based monitoring and others.

Deploy NVIDIA GPU Operator on hybrid nodes using the official Helm chart.

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --create-namespace \ --set driver.enabled=true \ --set toolkit.enabled=true \ --set devicePlugin.enabled=true \ --set gfd.enabled=true \ --set migManager.enabled=true \ --set nodeStatusExporter.enabled=true \ --set dcgmExporter.enabled=true \ --set operator.defaultRuntime=containerd \ --set operator.runtimeClass=nvidia \ --wait

Wait until all pods in the gpu-operator namespace are running or completed.

$ kubectl get pods -n gpu-operator NAMESPACE NAME READY STATUS RESTARTS AGE gpu-operator gpu-feature-discovery-7jvph 1/1 Running 1 (2m39s ago) 15d gpu-operator gpu-operator-7569f8b499-7k59n 1/1 Running 1 (2m39s ago) 27m gpu-operator gpu-operator-node-feature-discovery-gc-55ffc49ccc-glq9l 1/1 Running 1 (2m39s ago) 27m gpu-operator gpu-operator-node-feature-discovery-master-6b5787f695-n92x4 1/1 Running 1 (2m39s ago) 27m gpu-operator gpu-operator-node-feature-discovery-worker-9wqq5 1/1 Running 1 (2m39s ago) 15d gpu-operator nvidia-container-toolkit-daemonset-f9brm 1/1 Running 1 (2m39s ago) 15d gpu-operator nvidia-cuda-validator-nzwmh 0/1 Completed 0 92s gpu-operator nvidia-dcgm-exporter-hn4vz 1/1 Running 1 (2m39s ago) 15d gpu-operator nvidia-device-plugin-daemonset-4kb5c 1/1 Running 1 (2m39s ago) 15d gpu-operator nvidia-node-status-exporter-xpz9j 1/1 Running 1 (2m39s ago) 15d gpu-operator nvidia-operator-validator-t662d 1/1 Running 1 (2m39s ago) 15d

The NVIDIA GPU Operator validates the stack using the nvidia-operator-validator and the nvidia-cuda-validator pods. Verify the logs on these pods and confirm that the validations are successful.

$ kubectl logs -n gpu-operator nvidia-operator-validator-t662d Defaulted container "nvidia-operator-validator" out of: nvidia-operator-validator, driver-validation (init), toolkit-validation (init), cuda-validation (init), plugin-validation (init) all validations are successful $ kubectl logs -n gpu-operator nvidia-cuda-validator-nzwmh Defaulted container "nvidia-cuda-validator" out of: nvidia-cuda-validator, cuda-validation (init) cuda workload validation is successful

The GPU within the DGX Spark node is now exposed to the kubelet and is visible in nodes allocatable:

$ kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu" -l eks.amazonaws.com/compute-type=hybrid NAME GPU mi-0e06d30895cfcc155 1

Deploy NVIDIA NIM for inference on EKS Hybrid Nodes

To deploy NVIDIA NIM, you must set up an NVIDIA NGC API key and create container registry secrets using the key.

kubectl create secret docker-registry ngc-secret --docker-server=nvcr.io --docker-username='$oauthtoken' --docker-password=$NGC_API_KEY kubectl create secret generic ngc-api --from-literal=NGC_API_KEY=$NGC_API_KEY

Download the NIM Helm chart using the following command:

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-<version_number>.tgz --username='$oauthtoken' --password=$NGC_API_KEY cd nim-deploy/helm

Select a supported model for NVIDIA NIM based on the GPU specification of your hybrid nodes. Create the helm charts overrides using the NIM container image path, and set the ngcAPISecret and imagePullSecrets using the secrets created in Step 1.

cat > qwen3-32b-spark-nim.values.yaml <<EOF image: repository: "nvcr.io/nim/qwen/qwen3-32b-dgx-spark" tag: 1.0.0-variant model: ngcAPISecret: ngc-api nodeSelector: eks.amazonaws.com/compute-type: hybrid resources: limits: nvidia.com/gpu: 1 persistence: enabled: false imagePullSecrets: - name: ngc-secret tolerations: - key: "nvidia.com/gpu" operator: "Exists" effect: "NoSchedule" EOF

Deploy a NIM based LLM using the following command. In this example I’m running a Qwen3-32B image that is specifically optimized for the DGX Spark node.

helm install my-nim nim-llm-1.15.4.tgz -f ./qwen3-32b-spark-nim.values.yaml

This deployment isn’t persistent and doesn’t use a model cache. To implement a model cache, you need to install CSI drivers and configure Persistent Volumes using the on-premises storage infrastructure.

The NIM pod deployed on hybrid nodes is routable through BGP, thus you can directly access its API to test the model.

$ kubectl get pods -o wide | grep nim my-nim-nim-llm-0 1/1 Running 0 86m 192.168.64.102 mi-0e06d30895cfcc155 <none> <none> $ curl -X 'POST' \ "http://192.168.64.102:8000/v1/chat/completions" \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "model": "Qwen/Qwen3-32B", "prompt": "What is Kubernetes?", "max_tokens": 100 }'

The following is an example of expected response:

{ "id": "cmpl-d5161978bda9401b9b7a4ef0a529b6ce", "object": "text_completion", "created": 1770465499, "model": "Qwen/Qwen3-32B", "choices": [ { "index": 0, "text": " Why do you need it?\n\nKubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications. It is an open-source system that was originally developed by Google and is now maintained by the Cloud Native Computing Foundation (CNCF). Kubernetes allows developers to easily deploy and manage applications in a distributed environment, making it a popular choice for organizations that use containerized applications.\n\nOne of the main reasons why Kubernetes is needed is because it provides a way to manage container", "logprobs": null, "finish_reason": "length", "stop_reason": null, "prompt_logprobs": null } ], "service_tier": null, "system_fingerprint": null, "usage": { "prompt_tokens": 4, "total_tokens": 104, "completion_tokens": 100, "prompt_tokens_details": null }, "kv_transfer_params": null }

You have successfully deployed an LLM using NVIDIA NIM on your EKS Hybrid Nodes.

Configure centralized monitoring and observability for GPU metrics

The following two sections walk you through configuring centralized monitoring and observability for GPU metrics.

Install EKS Node Monitoring Agent

The EKS Node Monitoring Agent (NMA) is bundled into a container image that can be deployed as a DaemonSet across your EKS Hybrid Nodes. It collects node health information and detects GPU-specific issues using the NVIDIA DCGM and NVIDIA Management Library (NVML). It reports health issues by updating node status conditions and emitting Kubernetes events. Go to this AWS Container post to learn more details on NMA.

To install the NMA on hybrid nodes, use the following AWS CLI command to create the Amazon EKS add-on.

aws eks create-addon --cluster-name <CLUSTER_NAME> --addon-name eks-node-monitoring-agent

When it’s installed, NMA starts collecting custom node conditions for the EKS Hybrid Nodes. From the following example, you can see NMA detected the 200 GbE clustering interface (enp1s0f0np0) of the hybrid node is disconnected because I am only using a single DGX Spark.

kubectl describe node mi-0e06d30895cfcc155 | sed -n '/^Conditions:/,/^Addresses:/p' | head -n -1 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- NetworkingReady False Sat, 07 Feb 2026 23:52:59 +1100 Sat, 07 Feb 2026 05:22:59 +1100 InterfaceNotRunning Interface Name: "enp1s0f0np0", MAC: "4c:bb:47:2c:11:1d" is not up KernelReady True Sat, 07 Feb 2026 05:12:28 +1100 Sat, 07 Feb 2026 05:12:28 +1100 KernelIsReady Monitoring for the Kernel system is active AcceleratedHardwareReady True Sat, 07 Feb 2026 05:12:28 +1100 Sat, 07 Feb 2026 05:12:28 +1100 NvidiaAcceleratedHardwareIsReady Monitoring for the Nvidia AcceleratedHardware system is active ContainerRuntimeReady True Sat, 07 Feb 2026 05:12:28 +1100 Sat, 07 Feb 2026 05:12:28 +1100 ContainerRuntimeIsReady Monitoring for the ContainerRuntime system is active StorageReady True Sat, 07 Feb 2026 05:12:28 +1100 Sat, 07 Feb 2026 05:12:28 +1100 DiskIsReady Monitoring for the Disk system is active [...]

NMA also provides an automated log collection method through a Kubernetes CRD called NodeDiagnostic. To enable the log collection from your hybrid nodes, create a NodeDiagnostic custom resource on your cluster, and refer to the Amazon EKS user guide for more details.

apiVersion: eks.amazonaws.com/v1alpha1 kind: NodeDiagnostic metadata: name: <HYBRID_NODE_NAME> spec: logCapture: destination: <S3_PRESIGNED_HTTP_PUT_URL>

Integrate NVIDIA DCGM Exporter with Amazon Managed Service for Prometheus and Amazon Managed Grafana

Beyond node health monitoring, you can use the NVIDIA DCGM Exporter (within the GPU Operator stack) to gather GPU performance metrics and telemetry data that can be scraped by Prometheus. This section shows how to integrate DCGM Exporter with Amazon Managed Service for Prometheus and Amazon Managed Grafana to enable enhanced GPU observability across your EKS Hybrid Nodes.

Start by creating an Amazon Managed Service for Prometheus workspace.

aws amp create-workspace --alias dgx-spark-metrics --region ap-southeast-2 --query 'workspaceId' --output text

Next, follow this user guide to create an IAM role that allows Prometheus to ingest the scraped GPU metrics from EKS Hybrid Nodes to the managed workspace. Verify that the role has the following permissions attached.

{ "Version":"2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "aps:RemoteWrite", "aps:GetSeries", "aps:GetLabels", "aps:GetMetricMetadata" ], "Resource": "*" } ] }

Prepare a Prometheus installation Helm values file as the following example. Provide the Prometheus ingestion role Amazon Resource Name (ARN) from the last step, update the remoteWrite endpoint path with the managed Prometheus workspace URL, and add the DCGM Exporter scrape configurations.

# RBAC permissions for service discovery rbac: create: true serviceAccounts: server: name: amp-iamproxy-ingest-service-account annotations: eks.amazonaws.com/role-arn: <AMP-INGEST-ROLE-ARN> server: persistentVolume: enabled: false remoteWrite: - url: https://<AWS-Managed-Prometheus-Workspace-URL>/api/v1/remote_write sigv4: region: <CLUSTER_REGION> queue_config: max_samples_per_send: 1000 max_shards: 200 capacity: 2500 global: scrape_interval: 30s external_labels: cluster: <CLUSTER_NAME> # Additional scrape configs for DCGM Exporter serverFiles: prometheus.yml: scrape_configs: # DCGM Exporter - GPU metrics - job_name: 'dcgm-exporter' kubernetes_sd_configs: - role: endpoints # Auto-discover Kubernetes endpoints namespaces: names: - gpu-operator # Look in gpu-operator namespace relabel_configs: - source_labels: [__meta_kubernetes_service_name] regex: nvidia-dcgm-exporter # Match the DCGM exporter service action: keep - source_labels: [__meta_kubernetes_pod_node_name] target_label: node # Add node label to metrics

Use Helm to deploy Prometheus to hybrid nodes using the preceding values. Prometheus uses DCGM Exporter to scrape GPU performance metrics and remote write to the Amazon Managed Service for Prometheus workspace.

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo add kube-state-metrics https://kubernetes.github.io/kube-state-metrics helm repo update kubectl create namespace prometheus helm install prometheus prometheus-community/prometheus \ -n prometheus \ -f ./prometheus-amp-helm-values.yaml

Follow this guide to create an Amazon Managed Grafana workspace, including the necessary permissions and authentication access through the IAM Identity Center. Then, configure the Grafana workspace to add Amazon Managed Service for Prometheus as a data source.
Finally, create a new Grafana dashboard (or import one like this) to visualize scraped GPU metrics such as GPU utilization, GPU memory used, and GPU temperature and energy consumption.

Figure 2: Use Amazon Managed Grafana to monitor and visualize GPU metrics and telemetry across hybrid nodes

You can integrate EKS Hybrid Nodes with AWS cloud services to streamline generative AI deployment on-premises by removing the Kubernetes management overhead, while maintaining consistent operational practices with centralized observability across cloud, on-premises, and edge locations.

Cleaning up

To avoid incurring long-term charges, delete the AWS resources created as part of the demo walkthrough.

helm delete my-nim helm delete prometheus -n prometheus aws amp delete-workspace --workspace-id <AMP-WORKSPACE-ID> --region <AWS_REGION> aws grafana delete-workspace --workspace-id <AMG-WORKSPACE-ID> --region <AWS_REGION> eksctl delete cluster --name <CLUSTER_NAME> --region <CLUSTER_REGION>

Clean up other prerequisite resources that you created if they’re no longer needed.

Conclusion

This post provides a practical example of how Amazon EKS Hybrid Nodes empowers generative AI deployment using your own GPU nodes at on-premises and edge locations. Organizations can use EKS Hybrid Nodes to accelerate AI implementation with data locality and minimal latency, while maintaining consistent management and centralized observability across distributed environments.

To learn more about EKS Hybrid Nodes or running AI/ML workloads on Amazon EKS, explore the following resources:

About the authors

Read Entire Article