Add Cloud Native AI Scheduling Challenges Whitepaper by rajaskakodkar · Pull Request #2164 · cncf/toc

rajaskakodkar · 2026-05-15T16:51:15Z

Adds the Cloud Native Scheduling Challenges Whitepaper

Signed-off-by: Rajas Kakodkar <rajaskakodkar16@gmail.com>

andreyvelich

Thanks for this effort @rajaskakodkar!
Overall, looks great, I left few thoughts.

andreyvelich · 2026-05-19T13:16:09Z

+* Data transformation: Normalizing, encoding categorical variables, feature scaling  
+* Data splitting: Dividing data into training, validation, and test sets
+
+From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline.


Maybe we can also mention unstructured data?

Suggested change

From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline.

From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Additionally, GPUs work well for unstructured data like images because image processing involves massive parallel math operations.

Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline.

This is a great suggestion @andreyvelich

andreyvelich · 2026-05-19T13:17:09Z

+
+From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline.
+
+Kubernetes resources like Jobs and CronJobs handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.


What about Spark here?

Suggested change

Kubernetes resources like Jobs and CronJobs handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.

Kubernetes resources like Jobs, CronJobs, and SparkApplications handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.

like is an example. There are many resources we could use.

I kinda agree with mentioning only Jobs and CronJobs as they are part of plain Kubernetes. Can SparkApplications be mentioned later, maybe..

andreyvelich · 2026-05-19T13:21:19Z

+Model development has two distinct activities that are often combined:
+
+* **Feature engineering** transforms prepared data into input features the model can use. This involves creating new variables, encoding categorical data, and selecting which features to include. Feature engineering is computationally similar to data preparation—CPU and I/O bound, parallelizable, often triggered by new data.  
+* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.


Would it make sense to add topic around HPO?

Suggested change

* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.

* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.

* **Hyperparameter tuning** optimizes how the model learns rather than the structure of the model itself. This includes adjusting parameters such as learning rate, batch size, optimizer choice, number of epochs, and dropout rates. Unlike architecture design, hyperparameter tuning is compute-intensive because it requires repeatedly training and evaluating many model variants. These tuning jobs are highly parallelizable and are commonly distributed across GPUs or clusters.

@andreyvelich please suggest a change for the paragraph below since it is no longer necessarily valid re: heavy resource demands only in the next stage. :)

andreyvelich · 2026-05-19T17:38:18Z

+* Tightly coupled: All workers must run simultaneously  
+* Sensitive to topology: Communication speed depends on GPU interconnects
+
+The default Kubernetes scheduler cannot handle these requirements. It will start pods as resources become available, potentially leaving a job stuck with partial resources indefinitely.


With the efforts around WAS, I wouldn't mention this, maybe we can say:
cc @helayoty @kannon92 @mm4tt

Suggested change

The default Kubernetes scheduler cannot handle these requirements. It will start pods as resources become available, potentially leaving a job stuck with partial resources indefinitely.

These characteristics require additional Kubernetes scheduler capabilities to support efficient all-or-nothing placement and topology-aware scheduling.

Yea good call out.

andreyvelich · 2026-05-19T17:47:20Z

+
+* **Long-running jobs.** A training run is not a request that completes in milliseconds. It is a job that runs for days or weeks. Interrupting it wastes all the work done since the last checkpoint. The scheduler must account for job duration, not just instantaneous resource needs.  
+* **Massive resource consumption.** Training large models requires hundreds or thousands of GPUs running simultaneously. A single job can consume the majority of a cluster's capacity for extended periods. This is not "scale horizontally by adding pods"—it is "reserve a large fraction of the cluster for one workload."  
+* **Tightly coupled distribution.** Distributed training uses collective communication patterns where all workers must participate. You cannot start with 7 of 8 workers and add the 8th later. You cannot lose one worker and continue with the remaining 7\. Either all workers are running, or the job cannot proceed. This is fundamentally different from web services, where losing one replica just shifts load to the others.  


Suggested change

* **Tightly coupled distribution.** Distributed training uses collective communication patterns where all workers must participate. You cannot start with 7 of 8 workers and add the 8th later. You cannot lose one worker and continue with the remaining 7\. Either all workers are running, or the job cannot proceed. This is fundamentally different from web services, where losing one replica just shifts load to the others.

* **Tightly coupled distribution.** Distributed training uses collective communication patterns where all workers must participate. You cannot start with 7 of 8 workers and add the 8th later. You cannot lose one worker and continue with the remaining 7. Either all workers are running, or the job cannot proceed. This is fundamentally different from web services, where losing one replica just shifts load to the others.

andreyvelich · 2026-05-19T19:50:24Z

+
+## ML Platform Tools
+
+These tools provide higher-level abstractions for ML workflows:


Suggested change

These tools provide higher-level abstractions for ML workflows:

These tools provide higher-level abstractions for AI workloads:

andreyvelich · 2026-05-19T19:51:02Z

+These tools provide higher-level abstractions for ML workflows:
+
+* **Kubeflow**  
+  * **Kubeflow Trainer** supports distributed training across frameworks (PyTorch, TensorFlow, PaddlePaddle, XGBoost). Provides job abstractions that handle worker coordination, including gang scheduling requirements.  


Ref: https://github.com/kubeflow/trainer#overview

Suggested change

* **Kubeflow Trainer** supports distributed training across frameworks (PyTorch, TensorFlow, PaddlePaddle, XGBoost). Provides job abstractions that handle worker coordination, including gang scheduling requirements.

* **Kubeflow Trainer** is a Kubernetes-native distributed AI platform for scalable LLM fine-tuning and training of AI models across a wide range of frameworks, including PyTorch, MLX, HuggingFace, DeepSpeed, JAX, XGBoost, and more. Provides job abstractions that handle worker coordination, including gang scheduling requirements and HPC workloads orchestration such as MPI and Flux.

andreyvelich · 2026-05-19T19:56:42Z

+| GPU Sharing | DRA (GA, K8s 1.34+) | KAI | HAMi, KubeRay, Volcano | Both | MIG requires DRA or vendor tools |
+| Scalability | Cluster Autoscaler, Karpenter | Armada, KAI, Kueue, Slinky, Volcano | interLink | Both | Large-scale scheduling is challenging |
+| I/O Bottlenecks | PersistentVolumes | \- | Fluid | Both | Storage and caching solutions |
+| Fault Tolerance | \- | Slinky,  | Kubeflow (elastic training) | Training | Framework-dependent |


andreyvelich · 2026-05-19T20:02:44Z

+| Preemption | PriorityClass (pod-level) | KAI, Kueue, Slinky, Volcano | \- | Both | Job-level preemption needs external tools |
+| Priority Scheduling | PriorityClass | All batch schedulers | \- | Both | Job-level priority in batch schedulers |
+| Reservation & Backfill | \- | Slinky, Volcano, YuniKorn | \- | Training | Advanced feature in some schedulers |
+| Topology Awareness (Node) | Topology Manager (NUMA), DRA CPU Driver (CPU topology) | KAI, Kueue, Slinky, Volcano | \- | Both | GPU interconnect awareness varies |
+| Topology Awareness (Cluster) | Topology Spread Constraints, DRANET (network DRA Driver) (limited) | KAI, Kueue, Slinky, Volcano | \- | Both | Network topology awareness is emerging |


andreyvelich · 2026-05-19T20:03:27Z

+**For ML engineers working with existing infrastructure:**
+
+1. Understand what scheduling tools are available in your cluster.  
+2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods.  


Suggested change

2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods.

2. Use the appropriate job abstractions (TrainJob, MPIJob, etc.) rather than raw pods.

suggest to also keep PyTorchJob for legacy environments

processLabelRule previously computed `shouldApply = !foundNamespace` unconditionally, so a `kind: label` rule with `matchCondition: AND` behaved identically to a `NOT` rule. Paired NOT/AND rules (e.g. apply `needs-triage` when no `triage/*` exists, remove it when one does) ended up firing in exactly the wrong situations: on a fresh PR with no labels the labeler would add `needs-triage`/`needs-kind`/`needs-group` and immediately remove them in the same run, and when a `triage/*` label was later added manually via the UI the paired `needs-*` label would never be cleared. Also teach the label-rule `match` parser to understand a single level of comma-separated brace alternation such as `{toc,tag/*,sub/*}`. `filepath.Match` on its own does not support braces, so previously such a pattern only matched a literal label whose name began with `{`. Adds focused tests for both behaviors, including a regression test that mirrors the cncf/toc#2164 scenario. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Riaan Kleinhans <riaankleinhans@gmail.com>

angellk · 2026-06-08T20:58:15Z

+* Data transformation: Normalizing, encoding categorical variables, feature scaling  
+* Data splitting: Dividing data into training, validation, and test sets
+
+From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline.


This is a great suggestion @andreyvelich

angellk · 2026-06-08T21:01:27Z

+Model development has two distinct activities that are often combined:
+
+* **Feature engineering** transforms prepared data into input features the model can use. This involves creating new variables, encoding categorical data, and selecting which features to include. Feature engineering is computationally similar to data preparation—CPU and I/O bound, parallelizable, often triggered by new data.  
+* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.


@andreyvelich please suggest a change for the paragraph below since it is no longer necessarily valid re: heavy resource demands only in the next stage. :)

angellk · 2026-06-08T21:27:44Z

+
+| Stage | Primary Resources | Duration | Scheduling Characteristics |
+| :---- | :---- | :---- | :---- |
+| Data Preparation | CPU, storage I/O, network | Minutes to hours | Parallelizable, event-driven, no gang requirement |


It's helpful to denote that there is no gang requirement -- could you please rephrase your suggesetion @andreyvelich

angellk · 2026-06-08T21:48:54Z

+**For ML engineers working with existing infrastructure:**
+
+1. Understand what scheduling tools are available in your cluster.  
+2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods.  


suggest to also keep PyTorchJob for legacy environments

angellk · 2026-06-08T21:51:11Z

+5. The ecosystem is maturing. Projects like Kueue, Slinky, Volcano, and KAI Scheduler are production-ready. Dynamic Resource Allocation is reaching stability. The tools exist—the question is choosing and integrating them.  
+6. Start simple, iterate. You don't need every feature on day one. Start with basic queuing and quotas, and add capabilities as your needs grow.
+
+The cloud-native AI landscape continues to evolve. New hardware (different GPU architectures, accelerators), new workload patterns (agentic systems, distributed inference), and new Kubernetes features will create both challenges and opportunities. This series provides a foundation; staying current requires ongoing engagement with the community.


This series provides a foundation; staying current requires ongoing engagement with the community.

love this ❤️

salaboy · 2026-06-11T13:21:50Z

+
+## The AI/ML Lifecycle
+
+AI and machine learning projects follow a lifecycle with distinct stages. Each stage has different resource requirements and scheduling characteristics.


I would add here a list of bullet points of the sections that are coming below, so the reader know what is coming and how many stages to expect:

Data Preparation

Model Development

Model Training

Model Inference

Emerging Patterns

These can be links to the following sections.

salaboy · 2026-06-11T13:32:28Z

+* **Real-time inference** responds to individual requests with low latency. A user sends a query; the model returns a prediction in milliseconds. Real-time inference requires:  
+  * Models preloaded in GPU memory (loading a model can take longer than serving a request)  
+  * Horizontal scaling to handle variable request rates  
+  * Low-latency networking


While Low-latency networking is always relevant, I don't think it is related to scheduling, or is it? I wonder if this was mentioned here, because a future point will make a reference to this.

salaboy · 2026-06-11T13:38:11Z

+| Batch Inference | Variable GPU count | Hours to days | Parallelizable, throughput-oriented |
+| Real-time Inference | GPUs with models preloaded | Continuous | Low latency, autoscaling, model serving |
+
+The key insight: different stages need different scheduling strategies. A cluster running the full ML lifecycle must handle event-driven pipelines, interactive notebooks, gang-scheduled training, and latency-sensitive inference—often simultaneously, competing for the same GPU resources.


I am not an expert on this topic, but doesn't make sense to have a single cluster for all these tasks? Wouldn't be more practical to have specialized clusters?

salaboy · 2026-06-11T13:46:04Z

+
+### **Traditional HPC Schedulers: Task-Level Scheduling**
+
+In traditional high-performance computing (HPC) environments, schedulers such as Slurm operate at the task level (also called a rank or process group member).


Do we need link for Slurm? are there other schedulers in that space? Is Slurm the most used one? Asking for context.

salaboy · 2026-06-11T13:48:40Z

+
+This model naturally supports gang scheduling, topology-aware placement, and reservation-based execution. These capabilities are particularly well-suited for model training workloads.
+
+In HPC-style schedulers, the scheduling unit is a task within a job, with all tasks scheduled together.


I would love to see a diagram here, to make the relationship between the concepts such as tasks and jobs more easily understandable for people that haven't use these systems in the past.

salaboy · 2026-06-11T13:54:16Z

+* New data arrives in a storage bucket → trigger a data preparation pipeline  
+* A model training job completes → trigger an evaluation job  
+* An upstream job fails → trigger a notification or retry
+


having a diagram here might also help to solidify how concepts connect to each other.

salaboy · 2026-06-11T14:24:47Z

+  * **Distributed training:** Workers use collective communication (all-reduce) that requires every participant. A partial allocation is useless.  
+  * **Multi-pod inference:** Model-parallel deployments and disaggregated serving architectures require all components (e.g., prefill and decode workers) to be running before the system can serve requests.  
+  * **Distributed data preparation:** Parallel jobs that must complete together to produce consistent output benefit from all-or-nothing scheduling.  
+* **Current state:** Kubernetes-native batch schedulers that support gang scheduling include the coscheduling plugin (via PodGroups), Armada, KAI Scheduler, and Volcano. Native gang scheduling with the Workloads API h[as been implemented in Kubernetes 1.35 as an alpha feature](https://kubernetes.io/docs/concepts/workloads/workload-api/), with a goal to reach beta in Kubernetes 1.36.


replace h[as with [has

salaboy · 2026-06-11T14:31:59Z

+# Job Orchestration Challenges
+
+This section outlines the scheduling challenges related to job orchestration—how jobs are admitted, ordered, and coordinated.
+


Here I would appreciate a list of things that will be covered later in this section, to provide a mental map for the reader.

We are going to cover:

Gang Scheduling

Resource Fairness and Quota Management

Queue Management

Preemption

Priority Scheduling

Resource Reservation and Backfill

salaboy · 2026-06-11T14:32:32Z

+  * Reservation locks resources for a specific job. The scheduler identifies which resources will be needed and stops scheduling new work to them, even if they're currently idle.  
+  * Backfill allows small, short jobs to use reserved resources temporarily, as long as they'll finish before the reserved job needs them.  
+* **Mechanics:** The scheduler estimates when reserved resources will be free (based on running jobs' expected completion), then allows backfill jobs that fit within that window. This requires jobs to declare (or the system to estimate) their expected runtime.
+


It feels to me that a summary is needed here before jumping to "What's next"

salaboy · 2026-06-11T14:40:47Z

+# Resource and Infrastructure Challenges
+
+This section outlines scheduling challenges related to hardware resources and infrastructure—where and how jobs run. While many examples reference training workloads, these challenges apply equally to multi-node inference deployments, such as model-parallel or disaggregated serving architectures.
+


A list with the next sections will be highly appreciated here

salaboy · 2026-06-11T14:41:44Z

+* **Failure domains.** Placing all workers in the same rack minimizes network hops but means a rack failure kills the entire job. Spreading workers across racks improves resilience but increases communication latency.
+
+The scheduler must balance these concerns. For latency-sensitive training, co-location may be worth the reduced resilience. For long-running jobs, spreading across failure domains and accepting higher latency may be preferable to risking a full restart.
+


A summary of what was covered here and the main takeaways will help a lot for people who read until the end to go with a high-level view of what was covered in this paper.

salaboy · 2026-06-11T14:54:26Z

Folks, congratulations, these papers are great reads. I've added some comments to improve the reader experience, but besides that this looks awesome.

I've noticed that some terms like (all-reduce) have no references and for the non ML/AI engineer those thing might require some concrete references, as they are mentioned in several places.

I can't wait for this to get published.

kfaseela · 2026-06-11T14:56:22Z

+  * Role-aware scheduling: the scheduler must understand pod roles (e.g., master vs. worker) and preempt workers before masters to avoid job failure  
+* **Handling failures without full restart:** For gang-scheduled jobs, one worker failure typically crashes the entire job. Elastic training relaxes this—the job continues with the surviving workers, and a replacement worker can join later.
+
+## Budget and Cost Constraints


The budget constraints section covers financial cost (GPU-hours, cloud spend) well. Should power/energy budget also be considered here as a parallel infrastructure constraint?

kfaseela · 2026-06-11T14:59:24Z

+
+# What's Next
+
+This paper examined the resource and infrastructure challenges for AI workloads. The final paper in this series, **Solutions and Practical Guidance for AI Workload Scheduling**, catalogs the tools and Kubernetes features that address these challenges, provides a reference table mapping challenges to solutions, and offers practical guidance including real-world use cases.


Should power consumption be included as a resource and infrastructure challenge?

kfaseela · 2026-06-11T15:09:47Z

+| Topology Awareness (Cluster) | Topology Spread Constraints, DRANET (network DRA Driver) (limited) | KAI, Kueue, Slinky, Volcano | \- | Both | Network topology awareness is emerging |
+| Resource Heterogeneity | Node selectors, labels | All batch schedulers | \- | Both | Standard Kubernetes features usually sufficient |
+| GPU Sharing | DRA (GA, K8s 1.34+) | KAI | HAMi, KubeRay, Volcano | Both | MIG requires DRA or vendor tools |
+| Scalability | Cluster Autoscaler, Karpenter | Armada, KAI, Kueue, Slinky, Volcano | interLink | Both | Large-scale scheduling is challenging |


The Scalability row/inference req autoscaling in the solutions table already covers good tools. Should KEDA be included somewhere? Ref: #2188 and https://www.cncf.io/blog/2026/05/27/gpu-autoscaling-on-kubernetes-with-keda-building-an-external-scaler/ . Just a qn coz we started with scheduling, but still the table covers scaling as well.

Add Cloud Native AI Scheduling Challenges Whitepaper

a84e737

Signed-off-by: Rajas Kakodkar <rajaskakodkar16@gmail.com>

rajaskakodkar requested review from a team as code owners May 15, 2026 16:51

andreyvelich reviewed May 19, 2026

View reviewed changes

riaankleinhans mentioned this pull request May 22, 2026

fix(labeler): honor matchCondition: AND and support brace patterns (my pick — matches the commit, scoped, descriptive) cncf/automation#439

Merged

brandtkeller linked an issue May 29, 2026 that may be closed by this pull request

[Initiative]: Cloud Native AI Scheduling Challenges Whitepaper #1641

Open

angellk requested review from angellk and salaboy June 2, 2026 15:49

angellk requested changes Jun 8, 2026

View reviewed changes

mrbobbytables requested review from chira001 and raravena80 June 9, 2026 15:27

salaboy reviewed Jun 11, 2026

View reviewed changes

kfaseela reviewed Jun 11, 2026

View reviewed changes


		From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline.

		Kubernetes resources like Jobs and CronJobs handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.

	Kubernetes resources like Jobs and CronJobs handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.
	Kubernetes resources like Jobs, CronJobs, and SparkApplications handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines.

	* Model architecture involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.
	* Model architecture involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins.
	* Hyperparameter tuning optimizes how the model learns rather than the structure of the model itself. This includes adjusting parameters such as learning rate, batch size, optimizer choice, number of epochs, and dropout rates. Unlike architecture design, hyperparameter tuning is compute-intensive because it requires repeatedly training and evaluating many model variants. These tuning jobs are highly parallelizable and are commonly distributed across GPUs or clusters.

	The default Kubernetes scheduler cannot handle these requirements. It will start pods as resources become available, potentially leaving a job stuck with partial resources indefinitely.
	These characteristics require additional Kubernetes scheduler capabilities to support efficient all-or-nothing placement and topology-aware scheduling.


		## ML Platform Tools

		These tools provide higher-level abstractions for ML workflows:

	* Kubeflow Trainer supports distributed training across frameworks (PyTorch, TensorFlow, PaddlePaddle, XGBoost). Provides job abstractions that handle worker coordination, including gang scheduling requirements.
	* Kubeflow Trainer is a Kubernetes-native distributed AI platform for scalable LLM fine-tuning and training of AI models across a wide range of frameworks, including PyTorch, MLX, HuggingFace, DeepSpeed, JAX, XGBoost, and more. Provides job abstractions that handle worker coordination, including gang scheduling requirements and HPC workloads orchestration such as MPI and Flux.

	\| Fault Tolerance \| \- \| Slinky, \| Kubeflow (elastic training) \| Training \| Framework-dependent \|
	\| Fault Tolerance \| \- \| Slinky, \| Kubeflow Trainer \| Training \| Framework-dependent \|

	2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods.
	2. Use the appropriate job abstractions (TrainJob, MPIJob, etc.) rather than raw pods.


		## The AI/ML Lifecycle

		AI and machine learning projects follow a lifecycle with distinct stages. Each stage has different resource requirements and scheduling characteristics.


		### Traditional HPC Schedulers: Task-Level Scheduling

		In traditional high-performance computing (HPC) environments, schedulers such as Slurm operate at the task level (also called a rank or process group member).


		This model naturally supports gang scheduling, topology-aware placement, and reservation-based execution. These capabilities are particularly well-suited for model training workloads.

		In HPC-style schedulers, the scheduling unit is a task within a job, with all tasks scheduled together.

		# Job Orchestration Challenges

		This section outlines the scheduling challenges related to job orchestration—how jobs are admitted, ordered, and coordinated.

		# Resource and Infrastructure Challenges

		This section outlines scheduling challenges related to hardware resources and infrastructure—where and how jobs run. While many examples reference training workloads, these challenges apply equally to multi-node inference deployments, such as model-parallel or disaggregated serving architectures.

		* Failure domains. Placing all workers in the same rack minimizes network hops but means a rack failure kills the entire job. Spreading workers across racks improves resilience but increases communication latency.

		The scheduler must balance these concerns. For latency-sensitive training, co-location may be worth the reduced resilience. For long-running jobs, spreading across failure domains and accepting higher latency may be preferable to risking a full restart.


		# What's Next

		This paper examined the resource and infrastructure challenges for AI workloads. The final paper in this series, Solutions and Practical Guidance for AI Workload Scheduling, catalogs the tools and Kubernetes features that address these challenges, provides a reference table mapping challenges to solutions, and offers practical guidance including real-world use cases.

Conversation

rajaskakodkar commented May 15, 2026

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

salaboy Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

salaboy commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kfaseela Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

salaboy Jun 11, 2026 •

edited

Loading

salaboy commented Jun 11, 2026 •

edited

Loading

kfaseela Jun 11, 2026 •

edited

Loading