esi_cluster_setup

acme.esi_cluster_setup(partition: str, n_workers: int = 2, mem_per_worker: str = 'auto', cores_per_worker: int | None = None, n_workers_startup: int = 1, timeout: int = 60, interactive: bool = True, interactive_wait: int = 120, start_client: bool = True, job_extra: List = [], mem_cushion: int = 100, **kwargs: Any | None) → None | Client | SLURMCluster | LocalCluster[source]

Start a Dask distributed SLURM worker cluster on the ESI HPC infrastructure (or local multi-processing)

Parameters:

partition (str) – Name of SLURM partition/queue to start workers in. Use the command sinfo in the terminal to see a list of available SLURM partitions on the ESI HPC cluster.
n_workers (int) – Number of SLURM workers (=jobs) to spawn
mem_per_worker (str) – Memory booking for each worker. Can be specified either in megabytes (e.g., mem_per_worker = 1500MB) or gigabytes (e.g., mem_per_worker = "2GB"). If mem_per_worker is “auto” it is attempted to infer a sane default value from the chosen partition, e.g., for partition = "8GBS" mem_per_worker is automatically set to the allowed maximum of ‘8GB’. On the IBM POWER partition “E880”, mem_per_worker is set to 16 GB if not provided. Note, even in queues with guaranteed memory bookings, it is possible to allocate less memory than the allowed maximum per worker to spawn numerous low-memory workers. See Examples for details.
cores_per_worker (None or int) – Number of CPU cores allocated for each worker. If None, core-count is set based on partition settings (DefMemPerCPU and QoS) with respect to CPU architecture (minimum 1 on x86_64, and 4 on IBM POWER).
n_workers_startup (int) – Number of spawned workers to wait for. If n_workers_startup is 1 (default), the code does not proceed until either 1 SLURM job is running or the timeout interval has been exceeded.
timeout (int) – Number of seconds to wait for requested workers to start (see n_workers_startup).
interactive (bool) – If True, user input is queried in case not enough workers (set by n_workers_startup) could be started in the provided waiting period (determined by timeout). The code waits interactive_wait seconds for a user choice - if none is provided, it continues with the current number of running workers (if greater than zero). If interactive is False and no worker could not be started within timeout seconds, a TimeoutError is raised.
interactive_wait (int) – Countdown interval (seconds) to wait for a user response in case fewer than n_workers_startup workers could be started. If no choice is provided within the given time, the code automatically proceeds with the current number of active dask workers.
start_client (bool) – If True, a distributed computing client is launched and attached to the dask worker cluster. If start_client is False, only a distributed computing cluster is started to which compute-clients can connect.
job_extra (list) – Extra sbatch parameters to pass to SLURMCluster.
mem_cushion (int) – Amount of memory to “withhold” from mem_per_worker to stay clear of partition limits (either imposed via QoS or MaxMemPerCPU)
**kwargs (dict) – Additional keyword arguments can be used to control job-submission details.

Returns:

proc – A distributed computing client (if start_client = True) or a distributed computing cluster (otherwise).

Return type:

object

Examples

The following command launches 10 SLURM workers with 2 gigabytes memory each in the 8GBS partition

>>> client = esi_cluster_setup(n_workers=10, partition="8GBS", mem_per_worker="2GB")

Use default settings to start 2 SLURM workers in the IBM POWER E880 partition (allocating 4 cores and 16 GB memory per worker)

>>> client = esi_cluster_setup(partition="E880")

The underlying distributed computing cluster can be accessed using

>>> client.cluster

Notes

The employed parallel computing engine relies on the concurrent processing library Dask. Thus, the distributed computing clients generated here are in fact instances of distributed.Client. This function specifically acts as a wrapper for dask_jobqueue.SLURMCluster. Users familiar with Dask in general and its distributed scheduler and cluster objects in particular, may leverage Dask’s entire API to fine-tune parallel processing jobs to their liking (if wanted).