Compute Copilot Kubernetes Spark Application Rightsizing
Spark Applications on Kubernetes (Startup-Only Optimization)
Overview
Our Container Rightsizing engine now supports Apache Spark workloads that are deployed and managed through the Spark Operator on Kubernetes. Spark jobs are often short-lived and dynamic, which makes it difficult to track and optimize resource usage across multiple runs. This feature bridges that gap by recognizing recurring Spark applications and providing aggregated recommendations that take historical data into account.
Why Spark Operator Support Matters
When Spark runs on Kubernetes through the Spark Operator, each execution of a SparkApplication spawns a new set of driver and executor pods. These pods are ephemeral—created for the duration of the job and then destroyed. As a result, they typically appear in metrics systems as distinct workloads, even when they represent multiple executions of the same logical job.
By introducing Spark Operator–aware rightsizing, our platform can now treat these recurring workloads as a single logical entity. This provides stable, consistent resource recommendations over time, reducing waste and improving performance for repeat Spark jobs.
How Does It Work?
At a high level, the system identifies Spark workloads managed by the Spark Operator, groups multiple runs of the same SparkApplication, and aggregates their resource usage metrics. The rightsizing engine then produces recommendations based on this aggregated historical data rather than treating each run independently.
Conceptually, this involves:
- Recognizing driver and executor pods associated with Spark Operator–managed workloads.
- Grouping runs that belong to the same logical SparkApplication.
- Aggregating resource utilization metrics across those runs.
- Generating resource recommendations that persist and evolve over time.
No manual configuration is required for this grouping; the system automatically detects and manages Spark workloads where possible.
Identification and Grouping
Pods are identified as Spark related pods if they carry tags applied by Spark Operator.
Specifically, the tags we use for this identification are spark_role or spark_app_selector.
If a pod carries either or both of those tags, it will be considered to be a SparkApplication-related pod and will be aggregated for better analysis.
Two types of pods are aggregated, Spark driver and executor pods.
The spark_role label is used to identfy driver and executor pods.
Each type is aggregated separately, and both are attributed to the SparkApplication that "owns" them.
If a pod is identified as a SparkApplication-related pod, it's controller type will be SparkApplication and its controller name will be the name of the SparkApplication that manages it.
The mapping from driver/executor pods to SparkApplication also relies on Kubernetes labels, using either the spark_app_name label or the app label.
What it does
When enabled for a SparkApplication (Spark Operator, sparkoperator.k8s.io/v1beta2), nOps automatically rightsizes the driver and executor pods at startup.
Kubernetes CPU/memory requests/limits are applied during admission, and Spark executors are aligned to that sizing by modifying runtime environment variables as follows:
SPARK_EXECUTOR_CORES = ceiling(recommended cores), minimum 1, integer value (Discrete integer values, minimum 1)
SPARK_EXECUTOR_MEMORY = recommended memory in MB with a small JVM headroom for small heaps (≤ 2 GiB → −300 MB), minimum 512 MB
Safety model: Spark workloads always use startup-only updates (VPA updateMode: Initial). We do not resize running drivers/executors to avoid stage disruption. If --set vpa.updater.tryinplace=true is enabled cluster-wide, it is ignored for Spark.
Requirements
- Spark Operator installed (
sparkoperator.k8s.io/v1beta2) - nOps Kubernetes Agent v0.4.19+ installed and functional
- Container Rightsizing enabled (see How to Enable Container Right Sizing)
How to enable (UI)
- Open Compute Copilot → Container Rightsizing.
- Find your SparkApplication by name/namespace.
- Enable rightsizing and pick a Policy (Maximum Savings / High availability / Dynamic).

How to enable (IaC) — basic
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: my-spark-job
namespace: spark-apps
labels:
nops-vpa/enabled: "true"
annotations:
nops-vpa/policy: "Dynamic" # or "Maximum Savings", "High availability"
spec:
type: Scala
mode: cluster
image: "spark:3.4.0"
How to enable (IaC) — advanced (per-container & sidecars)
metadata:
labels:
nops-vpa/enabled: "true"
annotations:
nops-vpa/policy.spark-kubernetes-driver: "High availability"
nops-vpa/policy.spark-kubernetes-executor: "Maximum Savings"
nops-vpa/container.monitoring-sidecar: "disabled"
FAQ
Why aren’t my Spark applications appearing in the dashboard?
Ensure that your nOps Kubernetes agent is up to date and working properly. Ensure that your Spark jobs are submitted using the Spark Operator. Jobs launched outside the operator are not currently recognized as Spark workloads and will be treated as generic Kubernetes workloads. They are still eligible for automated rightsizing, but will not benefit from the Spark-specific data aggregation.
Why are SparkApplication recommendations only applied on startup?
SparkApplication resource recommendations are only applied on startup to prevent interruption of potentially critical and long-running data analytics jobs.
How long does it take for recommendations to appear?
For a new SparkApplication, it make take 1 - 2 days before the initial recommendations are populated. SparkApplication rightsizing is subject to the same limitations as container rightsizing in this regard.
Can I customize how workloads are grouped or identified?
Grouping logic is automatic and based on conventions established by the Spark Operator. Contact support if you need to override default workload identification or aggregation.
Does this feature affect non-Spark workloads?
No. The Spark-specific aggregation applies only to workloads managed by the Spark Operator. Other Kubernetes workloads continue to be analyzed independently for container rightsizing.
How far back is historical data considered?
By default, recommendations are based on the available metrics, up to a 30 day look-back period.
Are there any limitations?
- Only Spark jobs managed by the Kubernetes Spark Operator are supported.
- Historical aggregation depends on the availability of past metrics data.
- Custom Spark launchers that bypass the operator may not appear in the dashboard, or will appear as generic Kubernetes workloads.
For more details on container rightsizing in general, see Container Rightsizing Overview.