Dask Cluster fails read_csv only when remote dask-workers join

Question

I'm new to data science/python, learning fast, engineer by trade, geek at heart, I know this is a bug, but there might be ways around this, I'll accept any ideas from the wild.

I have instantiated a Dask LocalCluster and Client on my local machine: Ryzen 3800X with 32GB RAM.

from dask.distributed import Client, LocalCluster
daskcluster = LocalCluster(host='0.0.0.0')
daskclient = Client(daskcluster)
daskclient

Scheduler: tcp://192.168.1.152:62020

Dashboard: http://192.168.1.152:8787/status

Then I'm trying to read in the data set. Panda and Dask can quite happily read the 25GB directory of 446x *.csv files. (Hours of processing)

%%time 

df = dd.read_csv(origPathFile) # Yeah, a while.
df = df.set_index("Date (UTC)") # expect almost 5 minutes
df = df.drop_duplicates() ## HOURS
df = df.repartition(npartitions=600) ## new - yet to time
df.to_parquet(outpathfile) #this is the line which commits and computes the above.

When I let that go on the 4 workers, 8 cores, 16 threads, great, no problem. But I'm doing this for the learning right? I have a Mac Mini with 8GB and another Ryzen3600 with another 32GB of RAM here. And RAM seems to be my bottleneck.

As soon as I bring up my Anaconda Prompts on the other Ryzen, or Terminal on the Mac Mini and execute a worker to join the fray:

dask-worker -memory-limit 10GB 192.168.1.152:62020

I get the error message:

FileNotFoundError                         Traceback (most recent call
last) <timed exec> in <module>

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in
set_index(***failed resolving arguments***)    3615                
npartitions=npartitions,    3616                 divisions=divisions,
-> 3617                 **kwargs    3618             )    3619 

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\shuffle.py
in set_index(df, index, npartitions, shuffle, compute, drop, upsample,
divisions, partition_size, **kwargs)
     83         sizes, mins, maxes = base.optimize(sizes, mins, maxes)
     84         divisions, sizes, mins, maxes = base.compute(
---> 85             divisions, sizes, mins, maxes, optimize_graph=False
     86         )
     87         divisions = divisions.tolist()

C:\ProgramData\Anaconda3\lib\site-packages\dask\base.py in
compute(*args, **kwargs)
    442         postcomputes.append(x.__dask_postcompute__())
    443 
--> 444     results = schedule(dsk, keys, **kwargs)
    445     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    446 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
get(self, dsk, keys, restrictions, loose_restrictions, resources,
sync, asynchronous, direct, retries, priority, fifo_timeout, actors,
**kwargs)    2664                     should_rejoin = False    2665             try:
-> 2666                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)    2667             finally:
2668                 for f in futures.values():

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
gather(self, futures, errors, direct, asynchronous)    1965           
direct=direct,    1966                 local_worker=local_worker,
-> 1967                 asynchronous=asynchronous,    1968             )    1969 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    814         else:
    815             return sync(
--> 816                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    817             )
    818 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in
sync(loop, func, callback_timeout, *args, **kwargs)
    345     if error[0]:
    346         typ, exc, tb = error[0]
--> 347         raise exc.with_traceback(tb)
    348     else:
    349         return result[0]

C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in f()
    329             if callback_timeout is not None:
    330                 future = asyncio.wait_for(future, callback_timeout)
--> 331             result[0] = yield future
    332         except Exception as exc:
    333             error[0] = sys.exc_info()

C:\ProgramData\Anaconda3\lib\site-packages\tornado\gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
_gather(self, futures, errors, direct, local_worker)    1824                             exc = CancelledError(ke)    1825                         else:
-> 1826                             raise exception.with_traceback(traceback)    1827                        
raise exc    1828                     if errors == "skip":

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/dask/bytes/core.py
in read_block_from_file()

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/core.py
in __enter__()

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/spec.py
in open()

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/implementations/local.py
in _open()

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/implementations/local.py
in __init__()

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/implementations/local.py
in _open()

FileNotFoundError: [Errno 2] No such file or directory:
'c:/Users/username/Python/data/origData/data-mtm-ss-wtt-2020-03-02-19-54-00.csv'

But it was WORKING: It's as though the remote worker is looking in it's C?\??, I find the same error when one of the other computers runs the LocalCluster, and I execute workers which enroll with it, while there are multiple machines workers enrolled on a single LocalCluster. read_csv fails.

I have a NAS, I can create an FTP server on it... nope, different error, whether I have remote workers or not, had to get around the account login.

%%time 
df = dd.read_csv('ftp://nas.local/PythonData/origData/*.csv') 
df = df.set_index("Date (UTC)") # expect almost 120seconds
df = df.drop_duplicates() ## HOURS
df = df.repartition(npartitions=600)
df.to_parquet(outpathfile)

Yay for about 3 seconds then...

KilledWorker                              Traceback (most recent call
last) <timed exec> in <module>

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in
set_index(***failed resolving arguments***)    3615                
npartitions=npartitions,    3616                 divisions=divisions,
-> 3617                 **kwargs    3618             )    3619 

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\shuffle.py
in set_index(df, index, npartitions, shuffle, compute, drop, upsample,
divisions, partition_size, **kwargs)
     83         sizes, mins, maxes = base.optimize(sizes, mins, maxes)
     84         divisions, sizes, mins, maxes = base.compute(
---> 85             divisions, sizes, mins, maxes, optimize_graph=False
     86         )
     87         divisions = divisions.tolist()

C:\ProgramData\Anaconda3\lib\site-packages\dask\base.py in
compute(*args, **kwargs)
    442         postcomputes.append(x.__dask_postcompute__())
    443 
--> 444     results = schedule(dsk, keys, **kwargs)
    445     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    446 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
get(self, dsk, keys, restrictions, loose_restrictions, resources,
sync, asynchronous, direct, retries, priority, fifo_timeout, actors,
**kwargs)    2664                     should_rejoin = False    2665             try:
-> 2666                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)    2667             finally:
2668                 for f in futures.values():

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
gather(self, futures, errors, direct, asynchronous)    1965           
direct=direct,    1966                 local_worker=local_worker,
-> 1967                 asynchronous=asynchronous,    1968             )    1969 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    814         else:
    815             return sync(
--> 816                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    817             )
    818 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in
sync(loop, func, callback_timeout, *args, **kwargs)
    345     if error[0]:
    346         typ, exc, tb = error[0]
--> 347         raise exc.with_traceback(tb)
    348     else:
    349         return result[0]

C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in f()
    329             if callback_timeout is not None:
    330                 future = asyncio.wait_for(future, callback_timeout)
--> 331             result[0] = yield future
    332         except Exception as exc:
    333             error[0] = sys.exc_info()

C:\ProgramData\Anaconda3\lib\site-packages\tornado\gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
_gather(self, futures, errors, direct, local_worker)    1824                             exc = CancelledError(key)    1825                         else:
-> 1826                             raise exception.with_traceback(traceback)    1827                        
raise exc    1828                     if errors == "skip":

KilledWorker:
("('from-delayed-pandas_read_text-read-block-getitem-63774c26477d7b369d337b87bd2c5520',
587)", <Worker 'tcp://192.168.1.152:62050', name: 0, memory: 0,
processing: 381>)

I'm doing this for fun, learning the limitations, I would dearly love to have the remote PC helping this out, but every time it reads a CSV, and the whole point of getting into dask is that it doesn't have to read them all at the same time, and then hold them all in RAM at the SAME time... I know another 32GB of RAM is only $300 but i'm trying to learn the limits of this like this.

The program I have coded already once in Panda's. I have to run rolling windows across this 216m line data set, after I have the right columns in the right units. the distributed nature of Dask is terribly fitting. and this data set is only 6 hours of a two month data set. I want this to work on this "small" subset, so that I can extrapolate to bigger and better things. It runs for 35 minutes then sucks the whole memory and pagefile dry, and locks up the system.

Ideally, dask is updated to allow local file reads by local workers only? via the program only? I only have local machines, no web clusters, no web big-data services at my disposal. Just a little QNAP NAS and a couple of new Ryzen PC's. Learning Python in self-isolation on the weekends.

Thoughts? Go!

Answer 1

It looks like you have two machines: - mac mini - A PC

In the first execution of your workflow, you started to read CSVs based on a local file path and C:/...csv and your other worker does not have that file path. You then tried to use FTP but this failed immediately.

A few things: 1) Make sure ftp is setup correctly and all machines can access files correctly 2) Consult https://filesystem-spec.readthedocs.io/en/latest/usage.html?highlight=ftp#instantiate-a-file-system to see if you need to pass additional options? 3) If FTP is complicated, you can always use HTTP+python to serve files with python -m http.server

You could also play with Resource Restrictions to restrict a subset of workers for reading in the data.

Dask Cluster fails read_csv only when remote dask-workers join

Question

1 answers

solution1
1 2020-05-17 17:28:58

Dask Cluster fails read_csv only when remote dask-workers join

Question

1 answers

solution1 1 2020-05-17 17:28:58

solution1
1 2020-05-17 17:28:58