slurm_cluster_setup
- acme.slurm_cluster_setup(partition: str = 'partition_name', n_cores: int = 1, n_workers: int = 1, processes_per_worker: int = 1, mem_per_worker: str | None = '1GB', n_workers_startup: int = 1, timeout: int = 60, interactive: bool = True, interactive_wait: int = 10, start_client: bool = True, job_extra: List = [], worker_extra_args: List[str] | None = None, scheduler_options: Dict | None = None, avail_partitions: List = [], invalid_partitions: List = [], mem_cushion: int = 100, **kwargs: Any | None) Client | SLURMCluster | None[source]
Start a distributed Dask cluster of parallel processing workers using SLURM
NOTE If you are working on the ESI or CoBIC HPC cluster, please use
esi_cluster_setup()orbic_cluster_setup()instead!- Parameters:
partition (str) – Name of SLURM partition/queue to use
n_cores (int) – Number of CPU cores per SLURM worker
n_workers (int) – Number of SLURM workers (=jobs) to spawn
processes_per_worker (int) – Number of processes to use per SLURM job (=worker). Should be greater than one only if the chosen partition contains nodes that expose multiple cores per job.
mem_per_worker (str or None) – Memory allocation for each worker. If None, partition’s DefMemPerCPU is queried.
n_workers_startup (int) – Number of spawned SLURM workers to wait for. The code does not return until either n_workers_startup SLURM jobs are running or the timeout interval (see below) has been exceeded.
timeout (int) – Number of seconds to wait for requested workers to start (see n_workers_startup).
interactive (bool) – If True, user input is queried in case not enough workers (set by n_workers_startup) could be started in the provided waiting period (determined by timeout). The code waits interactive_wait seconds for a user choice - if none is provided, it continues with the current number of running workers (if greater than zero). If interactive is False and no worker could be started within timeout seconds, a TimeoutError is raised.
interactive_wait (int) – Countdown interval (seconds) to wait for a user response in case fewer than n_workers_startup workers could be started. If no choice is provided within the given time, the code automatically proceeds with the current number of active dask workers.
start_client (bool) – If True, a distributed computing client is launched and attached to the dask worker cluster. If start_client is False, only a distributed computing cluster is started to which compute-clients can connect.
job_extra (list) – Extra sbatch parameters to pass to SLURMCluster.
worker_extra_args (list or None) – Additional arguments to be passed to
distributed.Workerscheduler_options (dict or None) – Additional arguments to be passed to
distributed.Scheduleravail_partition (list) – List of valid partition names (strings) that are available for launching dask workers. If not provided, partitions are fetched at runtime using sinfo
invalid_partition (list) – List of partition names (strings) that are not available for launching dask workers.
mem_cushion (int) – Amount of memory to “withhold” from mem_per_worker to stay clear of partition limits (either imposed via QoS or MaxMemPerCPU)
- Returns:
proc – A distributed computing client (if
start_client = True) or a distributed computing cluster (otherwise). If no SLURM workers can be started within the given timeout interval, proc is set to None.- Return type:
object or None
See also
dask_jobqueue.SLURMClusterlaunch a dask cluster of SLURM workers
esi_cluster_setupstart a SLURM worker cluster on the ESI HPC infrastructure
bic_cluster_setupstart a SLURM worker cluster on the CoBIC HPC infrastructure
local_cluster_setupstart a local Dask multi-processing cluster on the host machine
cluster_cleanupremove dangling parallel processing worker-clusters