Implementing assurance pipeline for Amazon EKS Platform

3 months ago 99

Organizations using Amazon Elastic Kubernetes Service (Amazon EKS) need to establish that their clusters are built-as-designed, production-ready, and follow Amazon EKS Best Practices. Although Amazon EKS manages the Kubernetes control plane, validating cluster configurations and establishing quality across infrastructure, applications, policies, and resilience remains a key responsibility for platform teams. This post details how platform engineering teams can build an assurance pipeline for Amazon EKS deployments, incorporating validation frameworks that verify configurations, test infrastructure as code (IaC), assess application resilience, and establish compliance with organizational standards.

This comprehensive validation approach complement the robust scalability capabilities of Amazon EKS, helping teams build confidence in their deployments and maintain high-quality Kubernetes environments that can handle the demands of large-scale operations.

Current pain points in validating EKS clusters

Organizations deploying applications on Amazon EKS face several validation challenges:

  • Infrastructure validation gaps: Traditional testing often focuses on application code, neglecting IaC validation, and leading to misconfigurations and deployment failures.
  • Siloed testing approaches: Teams often use disconnected testing methods across infrastructure, applications, and policies, creating blind spots in validation coverage.
  • Limited policy enforcement testing: Organizations struggle to validate that their Kubernetes policies are correctly enforced, potentially exposing security vulnerabilities.
  • Non-functional testing complexity: Load testing Kubernetes components such as CoreDNS requires specialized knowledge and tools that many teams lack.
  • Resilience assessment challenges: Understanding how applications behave during infrastructure failures is difficult without through simulation and frameworks/tools supporting that simulation.
  • Manual and time-consuming processes: Without automated validation frameworks, teams resort to manual validation, which are error-prone, limited in nature, and often lead to inefficient practices.

Solution overview

To address cluster validation challenges, we’ve developed an assurance pipeline that systematically validates Amazon EKS environments through six distinct frameworks, each serving a specific purpose in our validation process.

  1. Infrastructure validation (Terraform test): Validates infrastructure before deployment by testing EKS cluster component modules and verifying compliance with Amazon Web Services (AWS) best practices. This early validation process helps detect and resolve infrastructure issues during the development phase rather than in production.
  2. Behavioral testing (Pytest BDD): Validates cluster behavior through readable test scenarios that verify core operations such as pod scheduling and service discovery. The framework establishes proper component interactions and confirms that Kubernetes API operations respond as expected.
  3. Package validation (Helm testing): Verifies Helm chart installations and cluster add-ons deployment while establishing proper resource creation. This validation step maintains consistency as code moves between different environments.
  4. Policy compliance (Chainsaw): Tests admission controls, security policies, and network policies to establish that clusters adhere to organizational standards and compliance requirements. This comprehensive policy validation safeguards cluster security configurations.
  5. Performance assessment (Locust): Evaluates cluster performance under various load conditions by measuring component response times and monitoring scaling behavior. This testing helps identify potential performance bottlenecks before they impact production workloads.
  6. Resilience testing (AWS Tools): Uses AWS Resilience Hub and AWS Fault Injection Service (AWS FIS) to test failure recovery procedures and validate availability configurations. These tools help identify reliability improvements and establish robust cluster operations.

This pipeline gives us a clear view of our Amazon EKS environments, helping us catch issues before they affect our applications. Each framework adds a layer of validation, creating a practical approach to testing our Kubernetes infrastructure.

Prerequisites

The following prerequisites are necessary before continuing:

Furthermore, navigate to your GitLab project and configure the following:

  1. Go to Settings > CI/CD > Variables.
  2. Add the following variables:
    • AWS_ACCESS_KEY_ID: Your AWS access key
    • AWS_SECRET_ACCESS_KEY: Your AWS secret key
    • AWS_REGION: Your preferred AWS Region
    • CLUSTER_NAME: Your EKS cluster name

Walkthrough

In this walkthrough, you integrate the Amazon EKS validation framework in the GitLab CI/CD pipeline. Create a .gitlab-ci.yml file in your repository root with the following structure:

stages: - validate-infrastructure - deploy-infrastructure - validate-policies - deploy-applications - functional-tests - non-functional-tests - resilience-assessment variables: AWS_REGION: us-west-2 CLUSTER_NAME: eks-validation-cluster TERRAFORM_DIR: terraform HELM_DIR: helm POLICY_DIR: policies FUNCTIONAL_TEST_DIR: tests/functional LOAD_TEST_DIR: tests/load # Reusable templates .aws-auth: &aws-auth before_script: # Using GitLab CI/CD environment variables for AWS credentials # These should be set as protected and masked variables in GitLab CI/CD settings # No need to explicitly configure credentials as AWS CLI will automatically use these variables - export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID - export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY - export AWS_DEFAULT_REGION=$AWS_REGION - aws sts get-caller-identity # Verify AWS credentials are working .k8s-auth: &k8s-auth before_script: - aws eks update-kubeconfig --name $CLUSTER_NAME --region $AWS_REGION # Infrastructure validation and deployment terraform-validate: stage: validate-infrastructure image: hashicorp/terraform:latest <<: *aws-auth script: - cd $TERRAFORM_DIR - terraform init - terraform validate - terraform test terraform-plan: stage: validate-infrastructure image: hashicorp/terraform:latest <<: *aws-auth script: - cd $TERRAFORM_DIR - terraform init - terraform plan -out=tfplan artifacts: paths: - $TERRAFORM_DIR/tfplan expire_in: 1 day terraform-apply: stage: deploy-infrastructure image: hashicorp/terraform:latest <<: *aws-auth script: - cd $TERRAFORM_DIR - terraform init - terraform apply -auto-approve tfplan dependencies: - terraform-plan when: manual environment: name: production url: https://console.aws.amazon.com/eks/home?region=$AWS_REGION#/clusters/$CLUSTER_NAME # Policy validation policy-test: stage: validate-policies image: ghcr.io/kyverno/chainsaw:latest <<: *k8s-auth script: - cd $POLICY_DIR - chainsaw test --report-format junit --report-path chainsaw-report.xml artifacts: reports: junit: $POLICY_DIR/chainsaw-report.xml # Application deployment and Helm testing helm-deploy-test: stage: deploy-applications image: alpine/helm:latest <<: *k8s-auth script: - cd $HELM_DIR - helm dependency update ./ - helm upgrade --install app-release ./ --wait - helm test app-release --logs # Functional testing functional-test: stage: functional-tests image: python:3.9 <<: *k8s-auth script: - cd $FUNCTIONAL_TEST_DIR - pip install -r requirements.txt - pytest --bdd-format=pretty --junitxml=pytest-report.xml artifacts: reports: junit: $FUNCTIONAL_TEST_DIR/pytest-report.xml # Non-functional testing load-test: stage: non-functional-tests image: locustio/locust:latest <<: *k8s-auth script: - cd $LOAD_TEST_DIR - locust -f coredns_locustfile.py --headless -u 20 -r 2 -t 5m --html=locust-report.html artifacts: paths: - $LOAD_TEST_DIR/locust-report.html expire_in: 1 week # Resilience assessment resilience-assessment: stage: resilience-assessment image: amazon/aws-cli:latest <<: *aws-auth script: - cd resilience - ./run-resilience-assessment.sh $CLUSTER_NAME $AWS_REGION - ./run-fault-injection.sh $CLUSTER_NAME $AWS_REGION artifacts: paths: - resilience/assessment-report.json - resilience/fis-results.json expire_in: 1 week

1. Unit testing with Terraform test

Unit testing your infrastructure code is crucial for catching configuration errors early in the development cycle. It helps:

  • Validate that your infrastructure components are correctly defined
  • Establish that resources have the expected properties and configurations
  • Prevent costly mistakes before deploying to AWS
  • Provide documentation of expected infrastructure behavior
  • Enable refactoring with confidence

How to implement

To implement unit testing with Terraform’s native testing framework, you can use your existing Terraform repository and create a tests directory with an eks.tftest.hcl file. The eks.tftest.hcl file is a Terraform test configuration file used for unit testing Amazon EKS infrastructure code. It validates that your Amazon EKS infrastructure components are correctly defined before actual deployment to AWS, helping catch configuration errors early in the development cycle.In your existing Terraform project structure, create the following tests directory:

your-terraform-project/ ├── main.tf ├── eks.tf ├── ... ├── karpenter.tf ├── variables.tf ├── outputs.tf └── tests/ └── eks.tftest.hcl

Sample run and expected output

When running the Terraform tests with the new native testing framework, the output looks like the following:

$ cd terraform $ terraform test Testing terraform/tests/eks.tftest.hcl... run "create_eks_cluster"... pass Success! 1 passed, 0 failed.

For more detailed output with verbose flag:

$ terraform test -verbose Testing terraform/tests/eks.tftest.hcl... run "create_eks_cluster"... module.eks.cluster_name != ""... pass module.eks.cluster_version == var.eks_cluster_version... pass length(module.eks.eks_managed_node_groups) == 1... pass contains(keys(module.eks.eks_managed_node_groups), "karpenter")... pass module.karpenter.node_iam_role_name == local.name... pass helm_release.karpenter.namespace == "kube-system"... pass helm_release.karpenter.chart == "karpenter"... pass helm_release.karpenter.version == "0.37.0"... pass module.vpc.name == local.name... pass module.vpc.vpc_cidr_block == var.vpc_cidr... pass length(module.vpc.private_subnets) == length(local.azs)... pass length(module.vpc.public_subnets) == length(local.azs)... pass length(module.vpc.intra_subnets) == length(local.azs)... pass pass Success! 1 passed, 0 failed.

If there are any failures, the detailed error message looks like the following:

$ terraform test Testing terraform/tests/eks.tftest.hcl... run "create_eks_cluster"... module.eks.cluster_version == var.eks_cluster_version... fail EKS cluster version should match the specified version module.eks.cluster_version is "1.29" var.eks_cluster_version is "1.30" fail Error: 1 test failed.

These outputs provide comprehensive validation that your infrastructure code is correctly defined and will create the expected resources when deployed.

2. Functional testing with Pytest BDD

Functional testing validates that your EKS cluster behaves as expected from an operational perspective. It’s essential because:

  • It verifies that critical Kubernetes components are running correctly
  • It establishes that cluster services are accessible and responding properly
  • It validates that the cluster can perform its intended functions
  • It catches integration issues that unit tests might miss
  • It provides confidence that the cluster works for end users

How to implement

Create a tests/functional directory with your BDD tests:

tests/functional/ ├── requirements.txt ├── conftest.py ├── features/ │ └── cluster_validation.feature └── steps/ └── cluster_steps.py

Example requirements.txt (specifies the Python package dependencies needed to run the functional tests and establish consistent test environments across different systems):

pytest pytest-bdd kubernetes boto3

Example cluster_validation.feature (behavior specifications written in Gherkin syntax, defines test scenarios in plain, human-readable language):

Feature: EKS Cluster Validation Scenario: Verify critical components are running Given an EKS cluster is available When I check the kube-system namespace Then all critical pods should be in Running state Scenario: Check logs for errors Given an EKS cluster is available When I check pods in the kube-system namespace Then logs should not contain any errors

Example cluster_steps.py (the actual Python implementation of the test steps defined in the feature file):

from pytest_bdd import given, when, then, parsers from kubernetes import client, config import boto3 @given("an EKS cluster is available") def eks_cluster(request): config.load_kube_config() v1 = client.CoreV1Api() return v1 @when(parsers.parse("I check the {namespace} namespace")) def check_namespace(eks_cluster, namespace): return eks_cluster.list_namespaced_pod(namespace) @then("all critical pods should be in Running state") def check_pods_running(check_namespace): for pod in check_namespace.items: assert pod.status.phase == "Running", f"Pod {pod.metadata.name} is not running" @when(parsers.parse("I check pods in the {namespace} namespace")) def check_pods_logs(eks_cluster, namespace): pods = eks_cluster.list_namespaced_pod(namespace) logs = {} for pod in pods.items: try: logs[pod.metadata.name] = eks_cluster.read_namespaced_pod_log( name=pod.metadata.name, namespace=namespace ) except Exception: logs[pod.metadata.name] = "" return logs @then("logs should not contain any errors") def check_logs_for_errors(check_pods_logs): error_keywords = ["error", "exception", "fail", "critical"] for pod_name, log in check_pods_logs.items(): for keyword in error_keywords: assert keyword.lower() not in log.lower(), f"Error found in {pod_name} logs"

Sample run and expected output

When running the functional tests, the output looks like the following:

$ cd tests/functional $ pytest --bdd-format=pretty --junitxml=pytest-report.xml ============================= test session starts ============================== platform linux -- Python 3.9.7, pytest-7.3.1, pluggy-1.0.0 rootdir: /repo/tests/functional plugins: bdd-6.1.1 collected 2 items Feature: EKS Cluster Validation # features/cluster_validation.feature:1 Scenario: Verify critical components are running # features/cluster_validation.feature:3 Given an EKS cluster is available # steps/cluster_steps.py:6 When I check the kube-system namespace # steps/cluster_steps.py:12 Then all critical pods should be in Running state # steps/cluster_steps.py:16 Scenario: Check logs for errors # features/cluster_validation.feature:8 Given an EKS cluster is available # steps/cluster_steps.py:6 When I check pods in the kube-system namespace # steps/cluster_steps.py:20 Then logs should not contain any errors # steps/cluster_steps.py:32 ============================= 2 passed in 8.32s ===============================

The JUnit XML report (pytest-report.xml) contains structured test results like the following:

<?xml version="1.0" encoding="utf-8"?> <testsuites> <testsuite name="features.cluster_validation" errors="0" failures="0" skipped="0" tests="2" time="8.320" timestamp="2025-06-04T10:20:15"> <testcase classname="features.cluster_validation" name="Verify critical components are running" time="4.123"> </testcase> <testcase classname="features.cluster_validation" name="Check logs for errors" time="4.197"> </testcase> </testsuite> </testsuites>

This output demonstrates:

  • Successful execution of BDD scenarios
  • Verification that all critical pods are running
  • Confirmation that no errors are found in pod logs
  • Test timing information
  • Overall test summary showing all tests passed

The JUnit report can be integrated with CI/CD systems for reporting and tracking test results over time.

3. Helm testing

Helm testing establishes that your applications deploy correctly and function as expected within the Kubernetes environment. It’s important because:

  • It validates that your Helm charts are correctly structured
  • It establishes that deployed applications are accessible and functional
  • It verifies that services can communicate with each other
  • It catches configuration issues before they affect users
  • It provides a standardized way to test application deployments

How to implement

Create a helm directory with your Helm charts and tests:

helm/ ├── Chart.yaml ├── values.yaml ├── templates/ │ └── ... └── tests/ ├── test-connection.yaml └── test-resources.yaml

Example test-connection.yaml. This is a Helm test manifest that creates a temporary Pod to verify that the application’s service is accessible within the Kubernetes cluster by running a wget command):

apiVersion: v1 kind: Pod metadata: name: "{{ include "app.fullname" . }}-test-connection" labels: {{- include "app.labels" . | nindent 4 }} annotations: "helm.sh/hook": test spec: containers: - name: wget image: busybox command: ['wget'] args: ['{{ include "app.fullname" . }}:{{ .Values.service.port }}'] restartPolicy: Never

Sample run and expected output

When running Helm tests, the output looks like the following:

$ helm test app-release --logs NAME: app-release LAST DEPLOYED: Wed Jun 4 10:15:22 2025 NAMESPACE: default STATUS: deployed REVISION: 1 TEST SUITE: app-release-test-connection Last Started: Wed Jun 4 10:16:05 2025 Last Completed: Wed Jun 4 10:16:15 2025 Phase: Succeeded NOTES: Application successfully deployed and tested! POD LOGS: app-release-test-connection wget: download completed

This output confirms that:

  • The Karpenter Helm chart was successfully deployed
  • The Karpenter Controller pod was ready within the timeout period (120 seconds)
  • All tests passed successfully

4. Kubernetes policy testing with Chainsaw

Policy testing establishes that your Kubernetes cluster enforces the security and compliance requirements that your organization needs. It’s critical because:

  • It validates that security policies are correctly implemented
  • It establishes that non-compliant resources are rejected
  • It verifies that your governance controls are working
  • It helps maintain compliance with industry standards and regulations
  • It prevents security vulnerabilities from being introduced

How to implement

Create a policies directory with your Kyverno policies and Chainsaw tests:

policies/ ├── kyverno-policies/ │ ├── require-labels.yaml │ └── restrict-image-registries.yaml └── tests/ ├── test-require-labels.yaml └── test-restrict-registries.yaml

Example test-require-labels.yaml. This is a Chainsaw test manifest that validates Kubernetes label policy enforcement by applying a policy, testing a valid deployment passes, and confirming that an invalid deployment without the required labels is properly rejected:

apiVersion: chainsaw.kyverno.io/v1alpha1 kind: Test metadata: name: test-require-labels spec: steps: - name: step-01-apply-policy apply: file: ../kyverno-policies/require-labels.yaml - name: step-02-apply-valid-deployment apply: file: resources/valid-deployment.yaml - name: step-03-apply-invalid-deployment apply: file: resources/invalid-deployment.yaml expect: reject: true message: "validation error: required labels are not set"

Sample run and expected output

When running policy tests with Chainsaw, the output looks like the following:

$ chainsaw test --report-format junit === RUN test-require-labels === RUN test-require-labels/step-01-apply-policy INFO[0000] ✅ Successfully applied resource name=require-labels namespace=default resource=ClusterPolicy.kyverno.io/v1 === RUN test-require-labels/step-02-apply-valid-deployment INFO[0001] ✅ Successfully applied resource name=valid-deployment namespace=default resource=Deployment.apps/v1 === RUN test-require-labels/step-03-apply-invalid-deployment INFO[0002] ✅ Resource rejected as expected name=invalid-deployment namespace=default resource=Deployment.apps/v1 INFO[0002] ✅ Error message matched expected="validation error: required labels are not set" received="admission webhook \"validate.kyverno.svc\" denied the request: resource Deployment/default/invalid-deployment was blocked due to the following policies: require-labels: validation error: required labels are not set" --- PASS: test-require-labels (3.45s) --- PASS: test-require-labels/step-01-apply-policy (0.82s) --- PASS: test-require-labels/step-02-apply-valid-deployment (1.21s) --- PASS: test-require-labels/step-03-apply-invalid-deployment (1.42s) PASS === RUN test-restrict-registries === RUN test-restrict-registries/step-01-apply-policy INFO[0000] ✅ Successfully applied resource name=restrict-image-registries namespace=default resource=ClusterPolicy.kyverno.io/v1 === RUN test-restrict-registries/step-02-apply-valid-deployment INFO[0001] ✅ Successfully applied resource name=valid-registry-deployment namespace=default resource=Deployment.apps/v1 === RUN test-restrict-registries/step-03-apply-invalid-deployment INFO[0002] ✅ Resource rejected as expected name=invalid-registry-deployment namespace=default resource=Deployment.apps/v1 INFO[0002] ✅ Error message matched expected="validation error: image registry not allowed" received="admission webhook \"validate.kyverno.svc\" denied the request: resource Deployment/default/invalid-registry-deployment was blocked due to the following policies: restrict-image-registries: validation error: image registry not allowed" --- PASS: test-restrict-registries (3.12s) --- PASS: test-restrict-registries/step-01-apply-policy (0.75s) --- PASS: test-restrict-registries/step-02-apply-valid-deployment (1.15s) --- PASS: test-restrict-registries/step-03-apply-invalid-deployment (1.22s) PASS Ran 2 test(s) in 6.57s Tests succeeded: 2, Failed: 0

This output demonstrates:

  • Successful application of the policy
  • The necessary RBAC permissions are in place for Kyverno to perform its operations
  • The policy for mutating deployments based on secret updates is correctly defined and active
  • Overall test summary showing all tests passed

5. Non-functional testing with Locust

Non-functional testing evaluates the performance, scalability, and reliability of your EKS cluster under various conditions. It’s vital because:

  • It identifies performance bottlenecks before they impact users
  • It determines the maximum capacity of your cluster
  • It validates that your cluster can handle expected load
  • It helps optimize resource allocation and scaling configurations
  • It establishes that critical services remain responsive under stress

How to implement

Create a tests/load directory with your Locust tests:

tests/load/ ├── coredns_locustfile.py └── karpenter_locustfile.py

Example coredns_locustfile.py, This is a Locust load testing script that simulates DNS resolution stress on CoreDNS by dynamically creating Kubernetes services, querying their DNS records, and deleting them to measure DNS performance under load:

from locust import HttpUser, task, between, TaskSet import kubernetes as k8s import random import string import time # Load Kubernetes configuration k8s.config.load_kube_config() v1 = k8s.client.CoreV1Api() namespace_name = "locust-test" def generate_service_name(length=10): return ''.join(random.choices(string.ascii_lowercase, k=length)) def create_service(name, namespace): service = k8s.client.V1Service( api_version="v1", kind="Service", metadata=k8s.client.V1ObjectMeta(name=name, namespace=namespace), spec=k8s.client.V1ServiceSpec( ports=[k8s.client.V1ServicePort(port=80, target_port=80)], selector={"app": name} ) ) return v1.create_namespaced_service(namespace=namespace, body=service) def generate_and_create_services(namespace, count=5): service_names = [] for _ in range(count): name = generate_service_name() create_service(name, namespace) service_names.append(name) return service_names def query_coredns(service_names, namespace): import dns.resolver resolver = dns.resolver.Resolver() resolver.nameservers = ['10.100.0.10'] # CoreDNS service IP for name in service_names: try: dns_name = f"{name}.{namespace}.svc.cluster.local" answers = resolver.resolve(dns_name, 'A') for rdata in answers: ip = rdata.address except Exception as e: print(f"DNS query failed: {e}") def delete_services(service_names, namespace): for name in service_names: v1.delete_namespaced_service(name=name, namespace=namespace) class CoreDNSUser(HttpUser): wait_time = between(1, 3) @task class CoreDNSTaskSet(TaskSet): @task def create_query_delete_services(self): # Create services service_names = generate_and_create_services(namespace_name, 5) # Query CoreDNS query_coredns(service_names, namespace_name) # Delete services delete_services(service_names, namespace_name)

Sample run and expected output

When running the Locust load tests, the output looks like the following:

$ locust -f coredns_locustfile.py --headless -u 20 -r 2 -t 5m [2025-06-04 10:30:12,345] INFO/MainProcess: Starting Locust 2.15.1 [2025-06-04 10:30:12,346] INFO/MainProcess: Starting web interface at http://0.0.0.0:8089 [2025-06-04 10:30:12,352] INFO/MainProcess: Starting Locust 2.15.1 [2025-06-04 10:30:12,352] INFO/MainProcess: Starting 20 users at a rate of 2 users/s [2025-06-04 10:35:12,456] INFO/MainProcess: Test finished Type Name # reqs # fails | Avg Min Max Med | req/s failures/s --------|----------------------------------------------------------------------------|-------|------------|-------|-------|-------|-------|--------|----------- Aggregated 542 0(0.00%) | 345 78 1245 320 | 1.8 0.00 Response time percentiles (approximated) Type Name 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100% # reqs --------|--------------------------------------------------------------------------------|--------|------|------|------|------|------|------|------|------|------|------|------ Aggregated 320 380 450 510 680 820 980 1100 1230 1240 1245 542 Test completed successfully.

This output demonstrates how your CoreDNS service performs under load, showing metrics like the following:

  • Average response time (345 ms)
  • Minimum and maximum response times (78 ms to 1245 ms)
  • Request throughput (1.8 requests per second)
  • Error rate (0% in this example)

You can use these metrics to identify potential bottlenecks and establish that your cluster can handle the expected load before deploying to production.

6. Resilience assessment with Resilience Hub

Resilience assessment evaluates how well your EKS cluster can withstand and recover from failures. It’s essential because:

  • It identifies single points of failure in your architecture
  • It validates that your recovery mechanisms work as expected
  • It establishes business continuity during disruptions
  • It helps meet availability SLAs and compliance requirements
  • It provides confidence that your cluster can handle real-world incidents

How to implement

Create a resilience directory with scripts for resilience assessment:

resilience/ ├── run-resilience-assessment.sh └── run-fault-injection.sh

Example run-resilience-assessment.sh. This is a shell script that creates a Resilience Hub application for an EKS cluster, runs a resilience assessment to evaluate its disaster recovery capabilities, and saves the results to a JSON file:

#!/bin/bash set -e CLUSTER_NAME=$1 REGION=$2 APP_NAME="${CLUSTER_NAME}-app" # Create Resilience Hub application if it doesn't exist APP_ARN=$(aws resiliencehub list-apps --query "appSummaries[?name=='${APP_NAME}'].arn" --output text) if [ -z "$APP_ARN" ]; then echo "Creating Resilience Hub application..." APP_ARN=$(aws resiliencehub create-app \ --name "${APP_NAME}" \ --description "EKS cluster resilience assessment" \ --app-template-body "{\"resources\":[{\"logicalResourceId\":{\"identifier\":\"${CLUSTER_NAME}\"},\"resourceType\":\"AWS::EKS::Cluster\",\"type\":\"AWS::EKS::Cluster\"}]}" \ --query "app.appArn" \ --output text)fi # Run assessment echo "Running resilience assessment..." ASSESSMENT_ARN=$(aws resiliencehub start-app-assessment \ --app-arn "${APP_ARN}" \ --assessment-name "pipeline-assessment-$(date +%Y%m%d-%H%M%S)" \ --query "assessment.assessmentArn" \ --output text) # Wait for assessment to complete echo "Waiting for assessment to complete..." aws resiliencehub wait assessment-executed --assessment-arn "${ASSESSMENT_ARN}" # Get assessment results echo "Getting assessment results..." aws resiliencehub describe-app-assessment \ --assessment-arn "${ASSESSMENT_ARN}" > assessment-report.json echo "Assessment complete. Results saved to assessment-report.json"

Example run-fault-injection.sh. This is a shell script that creates and runs an AWS FIS experiment to test the resilience of an EKS cluster by simulating an availability zone outage and capturing the results:

#!/bin/bash set -e CLUSTER_NAME=$1 REGION=$2 # Create FIS experiment template TEMPLATE_ID=$(aws fis create-experiment-template \ --targets "eks-cluster={resourceType=aws:eks:cluster,resourceArns=[arn:aws:eks:${REGION}:$(aws sts get-caller-identity --query Account --output text):cluster/${CLUSTER_NAME}]}" \ --actions "az-outage={actionId=aws:eks:inject-availability-zone-failure,targets={eks-cluster=eks-cluster},parameters={completionMode=forced}}" \ --stop-conditions "duration={source=none,value=10m}" \ --role-arn "arn:aws:iam::$(aws sts get-caller-identity --query Account --output text):role/FISExperimentRole" \ --description "Test EKS cluster resilience to AZ failure" \ --query "experimentTemplate.id" \ --output text) # Start FIS experiment EXPERIMENT_ID=$(aws fis start-experiment \ --experiment-template-id "${TEMPLATE_ID}" \ --query "experiment.id" \ --output text) echo "Started FIS experiment ${EXPERIMENT_ID}" # Wait for experiment to complete echo "Waiting for experiment to complete..." aws fis wait experiment-completed --id "${EXPERIMENT_ID}" # Get experiment results echo "Getting experiment results..." aws fis get-experiment \ --id "${EXPERIMENT_ID}" > fis-results.json echo "Experiment complete. Results saved to fis-results.json"

Sample run and expected output

When running the resilience assessment scripts, the output looks like the following:

$ ./run-resilience-assessment.sh eks-validation-cluster us-west-2 Creating Resilience Hub application... Running resilience assessment... Waiting for assessment to complete... Getting assessment results... Assessment complete. Results saved to assessment-report.json $ cat assessment-report.json { "assessment": { "appArn": "arn:aws:resiliencehub:us-west-2:123456789012:app/eks-validation-cluster-app/1a2b3c4d", "assessmentArn": "arn:aws:resiliencehub:us-west-2:123456789012:app-assessment/5e6f7g8h", "assessmentName": "pipeline-assessment-20250604-103015", "assessmentStatus": "SUCCEEDED", "complianceStatus": "POLICY_COMPLIANT", "resiliencyScore": 85.0, "driftStatus": "NOT_DRIFTED", "invoker": "USER", "appVersion": "1", "assessmentTimeStamp": "2025-06-04T10:30:15.000Z" } } $ ./run-fault-injection.sh eks-validation-cluster us-west-2 Started FIS experiment fis-12345678abcdef01 Waiting for experiment to complete... Getting experiment results... Experiment complete. Results saved to fis-results.json $ cat fis-results.json { "experiment": { "id": "fis-12345678abcdef01", "experimentTemplateId": "fit-12345678abcdef01", "state": { "status": "COMPLETED", "reason": "Experiment completed successfully" }, "targets": { "eks-cluster": { "resourceType": "aws:eks:cluster", "resourceArns": [ "arn:aws:eks:us-west-2:123456789012:cluster/eks-validation-cluster" ] } }, "actions": { "az-outage": { "actionId": "aws:eks:inject-availability-zone-failure", "state": { "status": "COMPLETED" }, "startTime": "2025-06-04T10:35:00.000Z", "endTime": "2025-06-04T10:45:00.000Z" } }, "startTime": "2025-06-04T10:35:00.000Z", "endTime": "2025-06-04T10:45:00.000Z" } }

These outputs show a successful resilience assessment with a score of 85.0 and a completed fault injection experiment that simulated an Availability Zone failure. The assessment indicates that the cluster is policy compliant, and the fault injection experiment completed successfully, helping you identify how your cluster responds to failures.

Pipeline monitoring and visualization

You can view the pipeline execution in GitLab’s CI/CD interface, which provides a visual representation of each stage and its status. The pipeline generates reports and artifacts that can be reviewed to assess the quality of your Amazon EKS deployment. This implementation creates a complete quality assurance pipeline that validates all aspects of your EKS clusters throughout the development lifecycle.

Benefits of the Amazon EKS validation framework

Our Amazon EKS validation framework brings practical value to our Kubernetes operations through several key benefits. We test each part of our Amazon EKS setup to catch and fix issues before they reach production, leading to more stable services for our users. Our policy tests verify that security measures work as planned, giving us confidence in our cluster protection. Through load testing, we understand how our applications and infrastructure handle increased traffic, helping us prepare for busy periods and plan for growth. Tools such as Resilience Hub and AWS FIS teach us how our system reacts to failures so that we can improve recovery plans and reduce potential downtime. Moreover, the automation in our framework cuts down manual testing time so that we can focus more on building new features and responding quickly to changes. This approach of finding and fixing issues early in development saves costs when compared to addressing them in production. Our testing process also establishes that our Amazon EKS environment meets both regulatory standards and internal rules, clarifying audits and reviews. The framework is a practical tool that helps us build and maintain reliable Kubernetes infrastructure that serves our needs today and supports our growth for tomorrow.

Conclusion

A comprehensive quality assurance pipeline for Amazon EKS clusters helps establish that your Kubernetes environments are properly configured, secure, and ready for production workloads. You can implement the six validation components outlined in this post to verify correct infrastructure provisioning, establish functional correctness and security of applications, test the environment’s ability to handle expected load, and validate recovery from failures. This structured approach to validation builds confidence in your Amazon EKS deployments, reduces the risk of production issues, and maintains high-quality, production-ready Kubernetes environments. As containerized applications become increasingly critical to business operations, investing in comprehensive validation frameworks is essential. Organizations can maximize the benefits of Kubernetes while minimizing operational risks. You can adopt these validation frameworks to accelerate your journey toward reliable and performant Kubernetes deployments on Amazon EKS.


About the authors

Niall Thomson is a Principal Specialist Solutions Architect, Containers, at AWS where he helps customers who are building modern application platforms on AWS container services.

Ramesh Mathikumar is a Principal Consultant within the Global Financial Services practice. He has been working with Financial services customers over the last 25 years. At AWS, he helps customers succeed in their cloud journey by implementing AWS technologies every day.

Sundar Shanmugam is a Sr. Cloud Infrastructure Architect at AWS, specializing in solution architecture, workload migrations, and modernization. He focuses on developing innovative generative AI solutions and helps customers drive digital transformation while maximizing their AWS investment to achieve business objectives.

Read Entire Article