EKS Insights Dashboard
Getting Started with Compute Copilot's EKS Page: Metrics & Management
Get insights and metrics for all your EKS clusters on the EKS Clusters page in Compute Copilot to understand your cluster efficiency. The Compute Copilot's EKS page provides quick access to key metrics for all your EKS clusters. From this page, you can navigate to individual clusters to configure them on nOps for Compute Copilot and Business Context+ (BC+), view container utilization metrics, and access rightsizing recommendations. This documentation will guide you through how to use the EKS page.
What is Compute Copilot?
Compute Copilot for EKS is a powerful service designed to automatically optimize compute-based workloads, reducing AWS EKS costs by intelligently adjusting your cluster in three key dimensions:
- Price Efficiency
- Container Efficiency
- Node Efficiency
Price Efficiency
Compute Copilot maximizes cost savings by migrating workloads to Spot instances and ensuring that your EC2 commitments (Savings Plans & Reserved Instances) are fully utilized.
Container Efficiency
Compute Copilot enhances container efficiency through automated rightsizing. It analyzes workload patterns and provides CPU and Memory request recommendations, which you can apply manually or enable automatic adjustments.
Node Efficiency
By optimizing container sizing, Compute Copilot ensures that Karpenter can dynamically adjust your EKS nodes. This leads to better resource utilization, reducing unused capacity and improving cost efficiency.
Learn more about how Compute Copilot for EKS can help you to put your EKS cost optimization on auto-pilot here.
What is Business Context+ (BC+)?
BC+ helps you understand and allocate 100% of your AWS costs, from your largest resources down to individual container costs. Click here to learn more about BC+.
How to Access the EKS Page for Compute Copilot?
To access the EKS page, navigate to Compute Copilot → EKS from the nOps dashboard.
You can also see how to access the page in the screen recording below:
Metrics Available in the EKS Page
As shown in the screen recording above, the EKS page features various cards and charts displaying key information about your EKS clusters. This page aggregates metrics across all your clusters that are correctly configured.
Cost Breakdown, Cost Over Time, and Container Utilization metrics require Business Contexts+ (BC+).
Here's an explanation of each card:
Container Efficiency
Container Efficiency in Compute Copilot evaluates how effectively an organization utilizes its containerized resources within a cluster. It provides insights into the balance between actual usage and waste, helping users optimize container allocation to reduce excess capacity and improve cost efficiency.
How Container Efficiency is Calculated
The Container Efficiency Score is determined by analyzing the ratio of container actual usage to total container resources (which includes both usage and waste). The score is expressed as a percentage from 0 to 100, with a higher score indicating more efficient resource utilization. The score is calculated based on the proportion of resources actively used relative to the total allocated resources (usage + waste), with a cap of 80% to ensure that a reasonable level of excess capacity is maintained for operational flexibility.
Breakdown of Metrics
- Estimated Opportunity: The total projected savings that could be achieved by optimizing container resource usage and reducing waste.
- Container Efficiency Score: A higher score indicates more efficient usage of containerized resources, with less wasted capacity and better alignment of resources to workload needs.
- Usage: The total resources actively used by containers within the cluster.
- Excess Capacity: The allocated container resources that are not being used (i.e., idle resources), representing wasted capacity.
- Total Cost: The total cost of the containerized resources, encompassing both used and unused capacity.
Node Efficiency
Node Efficiency in Compute Copilot evaluates how effectively an organization utilizes its allocated compute resources, highlighting potential areas for optimization. This metric provides insights into resource utilization by considering both the actual usage of nodes and the excess capacity that remains idle, helping users optimize infrastructure costs while maintaining desired performance levels.
How Node Efficiency is Calculated
The Node Efficiency Score is derived by analyzing the relationship between actual resource usage and available excess capacity, expressed as a score from 0 to 100. A higher score indicates a more efficient use of resources, with less wasted capacity, while a lower score reflects inefficiencies where allocated resources are underutilized.
The score is calculated based on the ratio of node usage relative to the total node resources, considering both usage and excess capacity. An 80% cap is applied to the efficiency score to account for the fact that a cluster should always have some level of excess capacity to maintain resilience and handle unexpected spikes in demand. This ensures the score doesn’t penalize reasonable buffer capacity that is required for maintaining cluster stability.
Breakdown of Metrics
- Estimated Opportunity: The projected savings that could be achieved by optimizing resource allocation and reducing idle capacity.
- Node Efficiency Score: A higher score indicates more efficient utilization of compute resources, with less unused capacity and better alignment of resources to workload needs.
- Usage: The actual consumption of allocated compute resources, such as CPU or memory, within the cluster.
- Excess Capacity: The compute resources that are allocated but not actively used, representing unused or wasted capacity.
- Overhead: The additional resource consumption due to system processes or management overhead that is not directly utilized by workloads.
- Total Cost: The total cost associated with the allocated resources, encompassing the expenses for both used and unused capacity.
Price Efficiency
Price Efficiency in Compute Copilot measures how well an organization optimizes cloud spending by balancing the use of Spot, On-Demand, and Commitment-based (Reserved Instances or Savings Plans) pricing models. It provides visibility into cost efficiency and helps users make informed decisions to reduce expenses while maintaining performance.
How Price Efficiency is Calculated
The Price Efficiency Score is determined by analyzing the proportion of costs across different pricing models. It is expressed as a score from 0 to 100, where a higher score indicates a more cost-efficient usage of cloud resources. The score is calculated based on the ratio of On-Demand costs relative to the total compute cost, with higher reliance on Spot and Commitment-based instances leading to a better score.
Breakdown of Metrics
- Estimated Opportunity: The total projected savings over the next 30 days based on potential optimizations.
- Price Efficiency Score: A higher score means a more cost-efficient usage of cloud resources, with a greater proportion of cost coming from Spot and Commitment-based pricing models rather than On-Demand instances.
- Spot Cost: Total cost incurred using Spot instances.
- On-Demand Cost: Total cost incurred using On-Demand instances.
- Commitments Cost: Total cost incurred through Reserved Instances or Savings Plans.
- Total Cost: Sum of all compute costs (Spot, On-Demand, and Commitments).
Overall Efficiency
The Overall Efficiency card in Compute Copilot provides a comprehensive view of an organization's resource optimization across multiple areas, including cost efficiency, node efficiency, and container efficiency. This card aggregates data from the Price Efficiency, Node Efficiency, and Container Efficiency metrics to offer a single score that reflects the overall effectiveness of resource utilization across these dimensions.
How Overall Efficiency is Calculated
The Overall Efficiency Score is calculated as the average of the individual scores from Price Efficiency, Node Efficiency, and Container Efficiency.
Each metric contributes equally to the Overall Efficiency Score, which is calculated as the average of the three individual efficiency scores. The final score is expressed as a percentage from 0 to 100, with a higher score indicating more optimized resource usage across these areas.
Breakdown of Metrics
- Estimated Total Opportunity: The total projected savings that could be achieved by optimizing cloud spending, node usage, and container resource allocation.
- Overall Efficiency Score: A composite score that provides a unified view of resource optimization, calculated by averaging the individual scores from Price Efficiency, Node Efficiency, and Container Efficiency.
Cost Breakdown
The Cost Breakdown chart on the EKS page provides a clear view of how costs are distributed across different components of your EKS clusters. This pie chart categorizes costs into several key areas, helping you understand where your budget is allocated. Here’s what each category represents:
-
CPU: This section shows the costs associated with the compute instances used by your clusters. It includes expenses related to the number of virtual CPUs and the types of instances in use.
-
GPU: This part highlights the costs for GPU resources utilized by your clusters. If your workloads use GPU instances, their associated costs are reflected here.
-
Memory: This category covers the expenses related to the amount of RAM allocated to your clusters. It includes costs based on memory usage.
-
Storage: This section represents the costs for various storage services, such as volumes and snapshots. It includes charges for storing and managing data within your clusters.
-
Network: This part accounts for costs associated with data transfer, NAT gateways, and load balancers. It includes charges for network-related services and data movement.
-
Extended Support: This category reflects additional charges incurred when your EKS clusters are not updated to the latest version. AWS applies these charges as a penalty for running outdated versions, which can include costs for extended support services.
-
Control Plane: This section includes costs not classified under the above categories. It covers miscellaneous expenses related to the management and operation of your clusters.
Cluster Score Overtime
The Clusters Score Over Time chart tracks the efficiency scores of your clusters over the past 30 days. This visualization helps you monitor trends in cluster efficiency and identify areas for optimization.
Number of Clusters
The Number of Clusters card on the EKS page provides an overview of your clusters and their configuration status. This card helps you quickly assess how many clusters you have and their current setup. Below is a breakdown of what each configuration status means:
Configuration Statuses
- Fully Configured: Represents clusters where all necessary nOps agents and configurations are installed, ensuring full monitoring and management capabilities.
- Partially Configured: Clusters that have some but not all required configurations or agents installed. These clusters are partially set up for monitoring or management.
- Not Configured: Represents clusters that have no configurations or agents installed. These clusters are not yet set up for monitoring or management.
Number of CPUs
Next to each configuration status, you’ll also see the number of CPUs used by the clusters. The number of CPUs reflects the total processing power consumed by the clusters and is used for billing purposes.
Cost Over Time
The Cost Over Time Chart only considers clusters with the BC+ agent installed.
The Cost Over Time chart visualizes the cost trends for your EKS clusters with the BC+ agent installed. It provides a historical view of your cluster costs, allowing you to track how expenses change over time. This chart helps you identify cost patterns and make informed decisions to optimize your spending.
Cluster List Table
The table is updated by a background job that runs every 3 hours. Therefore, newly created clusters or updates to existing clusters (such as EKS version changes) may not appear immediately. You might need to wait up to 3 hours for these changes to be reflected in the table.
The Cluster List Table provides a comprehensive overview of your EKS clusters, displaying key information and metrics to help you manage and monitor your clusters efficiently. The table includes the following columns:
-
Status: Provides the configuration status of each cluster, which can be Fully Configured, Partially Configured, or Not Configured.
-
Type: Indicates the type of cluster, such as Cluster Autoscaler, Karpenter, or Undetermined.
-
Cluster Name: The name assigned to each cluster. Clicking on the cluster name will open the cluster's dashboard for detailed insights.
-
Cluster ID: The Amazon Resource Name (ARN) for the cluster, providing a unique identifier.
-
EKS Version: Shows the version of Amazon EKS running on the cluster.
-
Region: Displays the AWS region where the cluster is located.
-
CPU: Represents the CPU utilization percentage for the cluster, shown as a blue line indicating the percentage. This metric is available only for clusters with the BC+ agent installed.
-
Memory: Displays the memory utilization percentage, similarly shown with a blue line. Like CPU metrics, this is available only for clusters with the BC+ agent.
-
Normalized 30 Days CPU: Represents the number of CPUs consumed by the cluster over the past 30 days.
-
Container Efficiency Savings/Score: Displays the estimated savings opportunity and efficiency score for container resource utilization within the cluster. A higher score indicates a better balance between resource allocation and actual usage.
-
Node Efficiency Savings/Score: Shows the estimated savings opportunity and efficiency score for node-level resource utilization. A higher score means the cluster is making better use of its allocated nodes with minimal waste.
-
Price Efficiency Savings/Score: Represents the estimated cost savings and price efficiency score, calculated based on the usage of Spot, On-Demand, and Commitment-based instances. A higher score reflects more cost-efficient cloud resource consumption.
-
30 Days Score: Displays the Overall Efficiency Score over the past 30 days, which is an average of the Price Efficiency, Node Efficiency, and Container Efficiency scores.
-
30 Days Cost: Shows the total cost incurred by the cluster over the past 30 days.
-
Action: Contains a button that opens a modal for configuring the cluster. This allows users to view and modify the cluster's settings.
How to Configure Your Cluster
To set up your EKS clusters, follow the guides below based on your preferred tool:
- Configure Compute Copilot for Karpenter: Instructions for setting up Compute Copilot with Karpenter.
- Configure Compute Copilot for Cluster Autoscaler: Instructions for setting up Compute Copilot with Cluster Autoscaler.
Ensure you have Karpenter or Cluster Autoscaler configured as prerequisites before following these guides.