Data science jobs requiring SLURM

Why SLURM Jobs Are in High Demand in 2026

SLURM (Simple Linux Utility for Resource Management) is the dominant workload manager and job scheduler for high-performance computing (HPC) clusters, and in 2026 it is a critical skill for ML engineers training large models on on-premises GPU clusters and HPC infrastructure at national labs, research universities, pharmaceutical companies, and financial institutions. As organizations invest heavily in private GPU infrastructure for AI workloads, SLURM expertise is essential for efficiently sharing and managing these expensive compute resources across multiple teams and projects.

ML engineers working with SLURM write job scripts that specify GPU allocation (number of GPUs, GPU memory requirements), memory, CPU cores, wall-clock time limits, and partitions (queues). SLURM's array jobs enable embarrassingly parallel hyperparameter sweeps where hundreds of training configurations run simultaneously across available cluster nodes. Integration with PyTorch Distributed and Ray for multi-node distributed training requires understanding SLURM's node allocation and inter-node networking — particularly with InfiniBand for high-bandwidth GPU-to-GPU communication via NCCL.

SLURM administration skills — configuring partitions, setting resource limits per user or group, managing GPU accounting, configuring preemption policies, and optimizing cluster utilization — are needed for platform engineers running ML training infrastructure. Tools like MLflow and Weights & Biases integrate with SLURM-managed jobs for experiment tracking. As cloud GPU costs remain high, many research organizations and enterprises with steady-state ML workloads prefer SLURM-managed on-premises GPU clusters, making SLURM expertise a specialized and well-compensated skill in the ML infrastructure space.

TransUnion Chicago, Illinois, United States

Machine Learning Platform Engineer

fulltime

C/C++ Dash Java MLflow Python +6 more

$90000 - $150000

Apply now