Skip to main content

Container Insights Overview

This architecture provides a scalable and automated solution for managing IAM roles in Amazon EKS clusters across multiple AWS regions. It leverages AWS Lambda to dynamically handle the creation, update, and deletion of IAM roles associated with the service account used by the nOps Agent Stack. These roles ensure that the agents have the necessary permissions to gather data and upload it to an Amazon S3 bucket for Container Insights. Additionally, the roles will also have permission to subscribe to an SQS queue that exists in the nOps AWS account, enabling real-time data fetching from the Kubernetes API for the Cluster Dashboard.

Key Features

  • Automated Role Management: Lambda functions dynamically manage IAM roles, ensuring each EKS cluster has the required permissions to interact with S3 and SQS.
  • Multi-Region Support: Easily extend the solution to EKS clusters across different AWS regions.
  • Flexibility: The CloudFormation template supports scenarios where an IAM user is required, making the solution adaptable to environments without an OIDC identity provider.

Deployment Workflow

  1. Deploy the CloudFormation Stack:

    • The stack automates the process of creating and managing IAM roles for EKS clusters, ensuring the proper roles are associated with the service account used by the nOps Agent Stack.

    Cloudformation

    note
    • Optional IAM User Support: The template can also handle situations where an IAM user might be needed, making it suitable for environments lacking OIDC identity providers.
    • Onboarding Confirmation: The Lambdas send a request to nOps to confirm CloudFormation onboarding.
  2. Install the nOps Agent Stack:

    • Once the stack is deployed, proceed with the installation of the nOpss Agent Stack within the cluster to begin data collection.

    • To install, navigate to Compute Copilot -> EKS, select your cluster, and go to the Configuration tab. Here, you can generate an API Key for the Helm upgrade command and copy the command to run in your cluster.

    • The command installs the Agent for Smart Compute Provisioning for Clusters with Karpenter, the Container Insights Agent Stack, and the Data Fetcher Agent.

    Helm Upgrade

    • To disable Smart Compute Provisioning for Clusters with Karpenter, add the following line to the command:
      --set karpenops.enabled=False
    • To disable the Data Fetcher Agent, add the following line:
      --set dataFetcher.enabled=False

With this setup, your clusters are fully equipped to gather and upload data, providing comprehensive insights into your containerized workloads.

How the Container Insights Agent Works

Stack overview

Stack overview

note

Communication within our agent stack between Prometheus, OpenCost, and the nOps Container Insights Agent is via HTTPS encrypted with TLS v1.2 or higher.

The agent operates through a series of cronjobs designed to collect and update data from Prometheus and OpenCost:

  • Hourly Cronjobs:

    • Two cronjobs collect metrics from your cluster and upload Parquet files to the S3 bucket.
  • Daily Cronjob: A daily cronjob runs to check and update, if necessary, the following components:

    • Karpenops
    • Opencost
    • Container Insights cronjobs
    • Heartbeat
    • Datafetcher
note

If your cluster includes NVIDIA GPU nodes, the dcgm-exporter daemonsets will be deployed on those nodes, exporting GPU-related metrics as well.

How the Data Fetcher Agent Works

Stack overview

The Data Fetcher Agent utilizes the Two-Way Messaging Pattern, leveraging Amazon SQS for real-time communication with nOps. This design ensures that Kubernetes data is promptly fetched and made available in your Cluster Dashboard for seamless, up-to-date visibility.

note

Encryption: The Data Fetcher Agent communicates securely with Amazon SQS using HTTPS, ensuring encryption in transit. At this time, Server-Side Encryption (SSE) for SQS queues is not enabled.


CloudFormation Template Breakdown

1. Parameters

The template accepts the following input parameters:

  • IncludeRegions:

    • A comma-separated list of AWS regions where the solution should operate.
    • If left blank, the CloudFormation stack defaults to the region where it is being created.
  • RoleName:

    • The name of the IAM role to which the read policy will be attached.
    • This role is created during the onboarding process for each AWS account into nOps.
  • CreateIAMUser:

    • A boolean flag (true or false) that determines whether to create an IAM user.
    • This is helpful for environments that don’t have an IAM OIDC provider for authentication.
  • Environment:

    • Specifies the nOps environment in which the solution will operate.
    • Allowed Values: PROD, UAT.
    • Default: PROD.
  • Token:

    • The nOps API token required for authentication.
    • The nOps API token is used to make a request to nOps to confirm the CloudFormation onboarding process.
    • Sensitive information; it will not be displayed in logs or outputs.
  • AutoUpdate:

    • Indicates whether the stack should automatically update when a new version is released.
    • Allowed Values: true, false.
    • Default: true.

2. Conditions

The template uses conditions to control resource creation based on the CreateIAMUser parameter:

  • CreateIAMUserCondition:

    • If CreateIAMUser is set to true, an IAM user will be created.
  • CreateIAMRoleCondition:

    • If CreateIAMUser is set to false, IAM roles will be created instead.

3. Resources

S3 Bucket

  • S3 Bucket:
    • Creates an S3 bucket named nops-container-cost suffixed with the account ID, with server-side encryption (AES-256) enabled.
    • This bucket is used to store data collected by the agent on each EKS cluster.
    • aws:SecureTransport is enforced through a Bucket Policy.

IAM Policies and Roles

  • ReadPolicy:

    • An IAM policy granting read-only access (s3:GetObject, s3:ListBucket) to the S3 bucket.
    • This policy is attached to the IAM role specified by the RoleName parameter, which is created during the initial AWS account onboarding process.
  • IAMUser:

    • This resource is conditionally created based on the CreateIAMUserCondition.
    • If true, an IAM user named nops-container-cost-s3 is created. This supports AWS accounts that lack an OIDC provider for authentication.
    • If this user is created, AWS credentials for the user must be generated and used during agent installation, as detailed in the documentation.
  • NOPSCrossAccountRole:

    • A cross-account IAM role used by the auto-update mechanism.
    • This role includes all necessary permissions required to handle IAM roles, policies, CloudFormation resources, and S3 interactions for auto-update operations.

Lambda Functions

  • NOPSContainerCostRoleCreationFunction:

    • A Lambda function responsible for creating and updating IAM roles for EKS clusters across multiple regions.
    • It is conditionally created if CreateIAMRoleCondition is true.
    • This function performs the following tasks:
      • Scans EKS clusters in the specified regions.
      • Creates or updates IAM roles associated with the OIDC identity provider for each EKS cluster.
      • Attaches an inline policy that grants interaction with the S3 bucket for each IAM role.
      • Attaches an inline policy that grants interaction with an SQS queue for each IAM role for real-time data fetching for your Cluster Dashboard.
      • A custom CloudFormation resource triggers this function upon stack creation, update, or deletion.
  • NOPSContainerCostRoleManagementFunction:

    • A secondary Lambda function that periodically checks the existence of roles and clusters.
    • It creates missing roles and cleans up orphaned ones.
    • This function is triggered every two hours via a CloudWatch event rule.

Lambda Execution Role

  • NOPSContainerCostLambdaRole:
    • An IAM role that grants the Lambda functions permissions to manage IAM roles, interact with EKS clusters, and make STS (Secure Token Service) calls.
    • This role includes all necessary permissions to handle roles and policies across multiple regions.

CloudWatch Event Rule

  • NOPSContainerCostScheduledCheckEventRule:
    • An EventBridge rule that triggers the NOPSContainerCostRoleManagementFunction every two hours.
    • This ensures that IAM roles are correctly aligned with the current state of EKS clusters.

4. Outputs

  • S3BucketSecureURL:

    • Outputs the secure HTTPS URL of the created S3 bucket.
  • BucketName:

    • Outputs the name of the S3 bucket created in the stack.

FAQ

  1. Where can I get information about Security and Compliance?
    You can find detailed reports on our stack and policies in this folder.

  2. Where can I find the CloudFormation template for inspection?
    You can find the template here.

  3. Do containers run as root?

    Most containers run as nobody with the exception of the datadog agent, which is non root but privileged, and the DCGM exporter which runs as root.

  4. Which images are used in the deployment, and are they digitally signed?
    Yes, the following images are used in our full deployment, and all of them have been digitally signed in our public ECR:

    These images are hosted in our public ECR to mitigate rate limit issues and are securely signed using AWS Signer. This ensures the integrity and authenticity of the images, supporting a reliable and trusted deployment pipeline.

  5. What is the nops-data-fetcher agent?

    The nOps Data Fetcher Agent provides real-time Kubernetes cluster insights, seamlessly integrating with your nOps dashboard. With this agent, you can visualize live data from your cluster directly within the Workloads tab, conveniently located next to the Nodes tab. To facilitate communication with nOps, the agent leverages an Amazon SQS queue for efficient and reliable data transfer.