Troubleshooting + FAQ

ACME is solely intended for executing user-provided functions multiple times in parallel. Thus, only problems that can be split up into independent tasks can be processed with ParallelMap (“embarassingly parallel workloads”). Inter-process communication, worker-synchronization or shared memory problems will not work.

If Things Go Wrong

First and foremost, ensure that the function you want to execute in parallel runs fine on its own. Consider

def f(x, y, z=3, w=np.arange(3)):
    return (x.sum() + y) * z * w.max()

Then

>>> f(np.ones((3,)), 4)
42.0

works as expected. However,

>>> f([1,1,1], 4)
AttributeError                            Traceback (most recent call last)
Input In [10], in <cell line: 1>()
----> 1 f([1,1,1], 4)

Input In [8], in f(x, y, z, w, **kwargs)
    1 def f(x, y, z=3, w=np.arange(3), **kwargs):
----> 2     return (x.sum() + y) * z * w.max()

AttributeError: 'list' object has no attribute 'sum'

fails, since the Python list [1,1,1] does not support summing its components via the method .sum() as NumPy arrays do. Thus, invoking ParallelMap like this:

with ParallelMap(f, [[1, 1, 1], [2, 2, 2]], 4) as pmap:
    results = pmap.compute()

will similarly make any spawned distributed workers crash with Compute Failed errors. The root problem is of course completely unrelated to actual parallel execution of f but is instead actually caused by using the wrong input type for x. Nail It Before You Scale It.

Once you have ensured that your function works fine in a sequential setting, you can try increasing ACME’s logging verbosity to get a better understanding of what’s happening under the hood:

with ParallelMap(myfunc, ..., logfile=True, verbose=True) as pmap:
    results = pmap.compute()

If your function works fine, but you think something’s wrong with ACME, please let us know by opening a bug report in our GitHub Issue Tracker.

FAQ

Q: My function runs fine sequentially. But with ACME I’m getting lots of distributed.core - INFO - Event loop was unresponsive in Worker for 3.34s.  This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability. Errors

This may happen if the function you’re wrapping with ParallelMap is defined in the same script you also invoke ParallelMap. See Best Practices for a strategy to circumvent this problem.

Q: I try to run ACME locally on my machine but I always get a RuntimeError

If you call ParallelMap (or local_cluster_setup()) inside a script that does not contain a if __name__ == "__main__" block, starting parallel workers results in an infinite recursion triggered by new processes being started before the calling process can finish its bootstrapping phase. Thus, try wrapping ParallelMap (or local_cluster_setup()) inside a main module block, i.e.,

if __name__ == "__main__":
    ...
    local_cluster_setup()

    with ParallelMap(...) ...
    ...

If you still encounter problems, try migrating the function you are ParallelMap-ping to a separate .py file (see Best Practices).

Q: I started a parallel computing client manually - how can I shut it down?

You can simply use the client’s .close() class method (e.g., myclient.close()) or use the ACME convenience function cluster_cleanup().

ESI-HPC Cluster Specifics

To make optimal use of the ESI HPC infrastructure, please make sure to first pick the right SLURM partition for your task (ACME’s default falls back to our “smallest” partition 8GBXS).

Instead of letting ACME automatically spawn and kill SLURM jobs, you can use the helper function esi_cluster_setup() to start a parallel computing client using a set number of SLURM jobs (=workers). This client can be re-used across several invocations of ParallelMap. Specifically, if you plan to call ParallelMap multiple times in your analysis script, overall runtime performance can be greatly improved by creating a client once at the beginning and subsequently “recycling” it for every ParallelMap call (otherwise ACME spawns and kills workers every time you invoke ParallelMap). In practice, a single line at the top of your script is sufficient to put this strategy into action (ACME picks up any existing client automatically, you don’t have to tell it beforehand):

from acme import ParallelMap, cluster_cleanup

if __name__ == "__main__":
    myclient = esi_cluster_setup(partition="16GBXL", n_workers=10)

    ...
    with ParallelMap(...) as pmap:
        pmap.compute()

    ...
    with ParallelMap(...) as pmap:
        pmap.compute()

    ...
    with ParallelMap(...) as pmap:
        pmap.compute()

    cluster_cleanup(myclient)