I'm new to data science/python, learning fast, engineer by trade, geek at heart, I know this is a bug, but there might be ways around this, I'll accept any ideas from the wild.
I have instantiated a Dask LocalCluster and Client on my local machine: Ryzen 3800X with 32GB RAM.
from dask.distributed import Client, LocalCluster
daskcluster = LocalCluster(host='0.0.0.0')
daskclient = Client(daskcluster)
daskclient
Scheduler: tcp://192.168.1.152:62020
Dashboard: http://192.168.1.152:8787/status
Then I'm trying to read in the data set. Panda and Dask can quite happily read the 25GB directory of 446x *.csv files. (Hours of processing)
%%time
df = dd.read_csv(origPathFile) # Yeah, a while.
df = df.set_index("Date (UTC)") # expect almost 5 minutes
df = df.drop_duplicates() ## HOURS
df = df.repartition(npartitions=600) ## new - yet to time
df.to_parquet(outpathfile) #this is the line which commits and computes the above.
When I let that go on the 4 workers, 8 cores, 16 threads, great, no problem. But I'm doing this for the learning right? I have a Mac Mini with 8GB and another Ryzen3600 with another 32GB of RAM here. And RAM seems to be my bottleneck.
As soon as I bring up my Anaconda Prompts on the other Ryzen, or Terminal on the Mac Mini and execute a worker to join the fray:
dask-worker -memory-limit 10GB 192.168.1.152:62020
I get the error message:
FileNotFoundError Traceback (most recent call
last) <timed exec> in <module>
C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in
set_index(***failed resolving arguments***) 3615
npartitions=npartitions, 3616 divisions=divisions,
-> 3617 **kwargs 3618 ) 3619
C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\shuffle.py
in set_index(df, index, npartitions, shuffle, compute, drop, upsample,
divisions, partition_size, **kwargs)
83 sizes, mins, maxes = base.optimize(sizes, mins, maxes)
84 divisions, sizes, mins, maxes = base.compute(
---> 85 divisions, sizes, mins, maxes, optimize_graph=False
86 )
87 divisions = divisions.tolist()
C:\ProgramData\Anaconda3\lib\site-packages\dask\base.py in
compute(*args, **kwargs)
442 postcomputes.append(x.__dask_postcompute__())
443
--> 444 results = schedule(dsk, keys, **kwargs)
445 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
446
C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
get(self, dsk, keys, restrictions, loose_restrictions, resources,
sync, asynchronous, direct, retries, priority, fifo_timeout, actors,
**kwargs) 2664 should_rejoin = False 2665 try:
-> 2666 results = self.gather(packed, asynchronous=asynchronous, direct=direct) 2667 finally:
2668 for f in futures.values():
C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
gather(self, futures, errors, direct, asynchronous) 1965
direct=direct, 1966 local_worker=local_worker,
-> 1967 asynchronous=asynchronous, 1968 ) 1969
C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
814 else:
815 return sync(
--> 816 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
817 )
818
C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in
sync(loop, func, callback_timeout, *args, **kwargs)
345 if error[0]:
346 typ, exc, tb = error[0]
--> 347 raise exc.with_traceback(tb)
348 else:
349 return result[0]
C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in f()
329 if callback_timeout is not None:
330 future = asyncio.wait_for(future, callback_timeout)
--> 331 result[0] = yield future
332 except Exception as exc:
333 error[0] = sys.exc_info()
C:\ProgramData\Anaconda3\lib\site-packages\tornado\gen.py in run(self)
733
734 try:
--> 735 value = future.result()
736 except Exception:
737 exc_info = sys.exc_info()
C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
_gather(self, futures, errors, direct, local_worker) 1824 exc = CancelledError(ke) 1825 else:
-> 1826 raise exception.with_traceback(traceback) 1827
raise exc 1828 if errors == "skip":
/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/dask/bytes/core.py
in read_block_from_file()
/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/core.py
in __enter__()
/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/spec.py
in open()
/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/implementations/local.py
in _open()
/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/implementations/local.py
in __init__()
/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/implementations/local.py
in _open()
FileNotFoundError: [Errno 2] No such file or directory:
'c:/Users/username/Python/data/origData/data-mtm-ss-wtt-2020-03-02-19-54-00.csv'
But it was WORKING: It's as though the remote worker is looking in it's C?\??, I find the same error when one of the other computers runs the LocalCluster, and I execute workers which enroll with it, while there are multiple machines workers enrolled on a single LocalCluster. read_csv fails.
I have a NAS, I can create an FTP server on it... nope, different error, whether I have remote workers or not, had to get around the account login.
%%time
df = dd.read_csv('ftp://nas.local/PythonData/origData/*.csv')
df = df.set_index("Date (UTC)") # expect almost 120seconds
df = df.drop_duplicates() ## HOURS
df = df.repartition(npartitions=600)
df.to_parquet(outpathfile)
Yay for about 3 seconds then...
KilledWorker Traceback (most recent call
last) <timed exec> in <module>
C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in
set_index(***failed resolving arguments***) 3615
npartitions=npartitions, 3616 divisions=divisions,
-> 3617 **kwargs 3618 ) 3619
C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\shuffle.py
in set_index(df, index, npartitions, shuffle, compute, drop, upsample,
divisions, partition_size, **kwargs)
83 sizes, mins, maxes = base.optimize(sizes, mins, maxes)
84 divisions, sizes, mins, maxes = base.compute(
---> 85 divisions, sizes, mins, maxes, optimize_graph=False
86 )
87 divisions = divisions.tolist()
C:\ProgramData\Anaconda3\lib\site-packages\dask\base.py in
compute(*args, **kwargs)
442 postcomputes.append(x.__dask_postcompute__())
443
--> 444 results = schedule(dsk, keys, **kwargs)
445 return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
446
C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
get(self, dsk, keys, restrictions, loose_restrictions, resources,
sync, asynchronous, direct, retries, priority, fifo_timeout, actors,
**kwargs) 2664 should_rejoin = False 2665 try:
-> 2666 results = self.gather(packed, asynchronous=asynchronous, direct=direct) 2667 finally:
2668 for f in futures.values():
C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
gather(self, futures, errors, direct, asynchronous) 1965
direct=direct, 1966 local_worker=local_worker,
-> 1967 asynchronous=asynchronous, 1968 ) 1969
C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
814 else:
815 return sync(
--> 816 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
817 )
818
C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in
sync(loop, func, callback_timeout, *args, **kwargs)
345 if error[0]:
346 typ, exc, tb = error[0]
--> 347 raise exc.with_traceback(tb)
348 else:
349 return result[0]
C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in f()
329 if callback_timeout is not None:
330 future = asyncio.wait_for(future, callback_timeout)
--> 331 result[0] = yield future
332 except Exception as exc:
333 error[0] = sys.exc_info()
C:\ProgramData\Anaconda3\lib\site-packages\tornado\gen.py in run(self)
733
734 try:
--> 735 value = future.result()
736 except Exception:
737 exc_info = sys.exc_info()
C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
_gather(self, futures, errors, direct, local_worker) 1824 exc = CancelledError(key) 1825 else:
-> 1826 raise exception.with_traceback(traceback) 1827
raise exc 1828 if errors == "skip":
KilledWorker:
("('from-delayed-pandas_read_text-read-block-getitem-63774c26477d7b369d337b87bd2c5520',
587)", <Worker 'tcp://192.168.1.152:62050', name: 0, memory: 0,
processing: 381>)
I'm doing this for fun, learning the limitations, I would dearly love to have the remote PC helping this out, but every time it reads a CSV, and the whole point of getting into dask is that it doesn't have to read them all at the same time, and then hold them all in RAM at the SAME time... I know another 32GB of RAM is only $300 but i'm trying to learn the limits of this like this.
The program I have coded already once in Panda's. I have to run rolling windows across this 216m line data set, after I have the right columns in the right units. the distributed nature of Dask is terribly fitting. and this data set is only 6 hours of a two month data set. I want this to work on this "small" subset, so that I can extrapolate to bigger and better things. It runs for 35 minutes then sucks the whole memory and pagefile dry, and locks up the system.
Ideally, dask is updated to allow local file reads by local workers only? via the program only? I only have local machines, no web clusters, no web big-data services at my disposal. Just a little QNAP NAS and a couple of new Ryzen PC's. Learning Python in self-isolation on the weekends.
Thoughts? Go!
It looks like you have two machines: - mac mini - A PC
In the first execution of your workflow, you started to read CSVs based on a local file path and C:/...csv
and your other worker does not have that file path. You then tried to use FTP but this failed immediately.
A few things: 1) Make sure ftp is setup correctly and all machines can access files correctly 2) Consult https://filesystem-spec.readthedocs.io/en/latest/usage.html?highlight=ftp#instantiate-a-file-system to see if you need to pass additional options? 3) If FTP is complicated, you can always use HTTP+python to serve files with python -m http.server
You could also play with Resource Restrictions to restrict a subset of workers for reading in the data.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.