简体   繁体   English

仅当远程 dask-workers 加入时,Dask Cluster 才会失败 read_csv

[英]Dask Cluster fails read_csv only when remote dask-workers join

I'm new to data science/python, learning fast, engineer by trade, geek at heart, I know this is a bug, but there might be ways around this, I'll accept any ideas from the wild.我是数据科学/python 的新手,学得很快,工程师出身,内心是个极客,我知道这是一个错误,但可能有办法解决这个问题,我会接受任何来自野外的想法。

I have instantiated a Dask LocalCluster and Client on my local machine: Ryzen 3800X with 32GB RAM.我在本地机器上实例化了 Dask LocalCluster 和客户端:Ryzen 3800X,32GB RAM。

from dask.distributed import Client, LocalCluster
daskcluster = LocalCluster(host='0.0.0.0')
daskclient = Client(daskcluster)
daskclient

Scheduler: tcp://192.168.1.152:62020调度器:tcp://192.168.1.152:62020

Dashboard: http://192.168.1.152:8787/status仪表板: http://192.168.1.152:8787/status

Then I'm trying to read in the data set.然后我试图读入数据集。 Panda and Dask can quite happily read the 25GB directory of 446x *.csv files. Panda 和 Dask 可以非常愉快地读取 446x *.csv 文件的 25GB 目录。 (Hours of processing) (处理时间)

%%time 

df = dd.read_csv(origPathFile) # Yeah, a while.
df = df.set_index("Date (UTC)") # expect almost 5 minutes
df = df.drop_duplicates() ## HOURS
df = df.repartition(npartitions=600) ## new - yet to time
df.to_parquet(outpathfile) #this is the line which commits and computes the above.

When I let that go on the 4 workers, 8 cores, 16 threads, great, no problem.当我将 go 放在 4 个工作人员、8 个内核、16 个线程上时,太好了,没问题。 But I'm doing this for the learning right?但我这样做是为了学习,对吧? I have a Mac Mini with 8GB and another Ryzen3600 with another 32GB of RAM here.我这里有一台 8GB 的 Mac Mini 和另一台 32GB RAM 的 Ryzen3600。 And RAM seems to be my bottleneck. RAM 似乎是我的瓶颈。

As soon as I bring up my Anaconda Prompts on the other Ryzen, or Terminal on the Mac Mini and execute a worker to join the fray:一旦我在另一个 Ryzen 或 Mac Mini 上的终端上打开我的 Anaconda 提示并执行一个工人加入战斗:

dask-worker -memory-limit 10GB 192.168.1.152:62020

I get the error message:我收到错误消息:

FileNotFoundError                         Traceback (most recent call
last) <timed exec> in <module>

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in
set_index(***failed resolving arguments***)    3615                
npartitions=npartitions,    3616                 divisions=divisions,
-> 3617                 **kwargs    3618             )    3619 

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\shuffle.py
in set_index(df, index, npartitions, shuffle, compute, drop, upsample,
divisions, partition_size, **kwargs)
     83         sizes, mins, maxes = base.optimize(sizes, mins, maxes)
     84         divisions, sizes, mins, maxes = base.compute(
---> 85             divisions, sizes, mins, maxes, optimize_graph=False
     86         )
     87         divisions = divisions.tolist()

C:\ProgramData\Anaconda3\lib\site-packages\dask\base.py in
compute(*args, **kwargs)
    442         postcomputes.append(x.__dask_postcompute__())
    443 
--> 444     results = schedule(dsk, keys, **kwargs)
    445     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    446 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
get(self, dsk, keys, restrictions, loose_restrictions, resources,
sync, asynchronous, direct, retries, priority, fifo_timeout, actors,
**kwargs)    2664                     should_rejoin = False    2665             try:
-> 2666                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)    2667             finally:
2668                 for f in futures.values():

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
gather(self, futures, errors, direct, asynchronous)    1965           
direct=direct,    1966                 local_worker=local_worker,
-> 1967                 asynchronous=asynchronous,    1968             )    1969 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    814         else:
    815             return sync(
--> 816                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    817             )
    818 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in
sync(loop, func, callback_timeout, *args, **kwargs)
    345     if error[0]:
    346         typ, exc, tb = error[0]
--> 347         raise exc.with_traceback(tb)
    348     else:
    349         return result[0]

C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in f()
    329             if callback_timeout is not None:
    330                 future = asyncio.wait_for(future, callback_timeout)
--> 331             result[0] = yield future
    332         except Exception as exc:
    333             error[0] = sys.exc_info()

C:\ProgramData\Anaconda3\lib\site-packages\tornado\gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
_gather(self, futures, errors, direct, local_worker)    1824                             exc = CancelledError(ke)    1825                         else:
-> 1826                             raise exception.with_traceback(traceback)    1827                        
raise exc    1828                     if errors == "skip":

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/dask/bytes/core.py
in read_block_from_file()

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/core.py
in __enter__()

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/spec.py
in open()

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/implementations/local.py
in _open()

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/implementations/local.py
in __init__()

/Applications/Anaconda/anaconda3/lib/python3.7/site-packages/fsspec/implementations/local.py
in _open()

FileNotFoundError: [Errno 2] No such file or directory:
'c:/Users/username/Python/data/origData/data-mtm-ss-wtt-2020-03-02-19-54-00.csv'

But it was WORKING: It's as though the remote worker is looking in it's C?\??, I find the same error when one of the other computers runs the LocalCluster, and I execute workers which enroll with it, while there are multiple machines workers enrolled on a single LocalCluster.但它正在工作:就好像远程工作人员正在查看它的 C?\??,当其他一台计算机运行 LocalCluster 时我发现同样的错误,我执行与它注册的工作人员,而有多台机器工人在单个 LocalCluster 上注册。 read_csv fails. read_csv 失败。

I have a NAS, I can create an FTP server on it... nope, different error, whether I have remote workers or not, had to get around the account login.我有一个 NAS,我可以在上面创建一个 FTP 服务器......不,不同的错误,无论我是否有远程工作者,都必须绕过帐户登录。

%%time 
df = dd.read_csv('ftp://nas.local/PythonData/origData/*.csv') 
df = df.set_index("Date (UTC)") # expect almost 120seconds
df = df.drop_duplicates() ## HOURS
df = df.repartition(npartitions=600)
df.to_parquet(outpathfile)

Yay for about 3 seconds then...耶约3秒然后...

KilledWorker                              Traceback (most recent call
last) <timed exec> in <module>

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\core.py in
set_index(***failed resolving arguments***)    3615                
npartitions=npartitions,    3616                 divisions=divisions,
-> 3617                 **kwargs    3618             )    3619 

C:\ProgramData\Anaconda3\lib\site-packages\dask\dataframe\shuffle.py
in set_index(df, index, npartitions, shuffle, compute, drop, upsample,
divisions, partition_size, **kwargs)
     83         sizes, mins, maxes = base.optimize(sizes, mins, maxes)
     84         divisions, sizes, mins, maxes = base.compute(
---> 85             divisions, sizes, mins, maxes, optimize_graph=False
     86         )
     87         divisions = divisions.tolist()

C:\ProgramData\Anaconda3\lib\site-packages\dask\base.py in
compute(*args, **kwargs)
    442         postcomputes.append(x.__dask_postcompute__())
    443 
--> 444     results = schedule(dsk, keys, **kwargs)
    445     return repack([f(r, *a) for r, (f, a) in zip(results, postcomputes)])
    446 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
get(self, dsk, keys, restrictions, loose_restrictions, resources,
sync, asynchronous, direct, retries, priority, fifo_timeout, actors,
**kwargs)    2664                     should_rejoin = False    2665             try:
-> 2666                 results = self.gather(packed, asynchronous=asynchronous, direct=direct)    2667             finally:
2668                 for f in futures.values():

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
gather(self, futures, errors, direct, asynchronous)    1965           
direct=direct,    1966                 local_worker=local_worker,
-> 1967                 asynchronous=asynchronous,    1968             )    1969 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
sync(self, func, asynchronous, callback_timeout, *args, **kwargs)
    814         else:
    815             return sync(
--> 816                 self.loop, func, *args, callback_timeout=callback_timeout, **kwargs
    817             )
    818 

C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in
sync(loop, func, callback_timeout, *args, **kwargs)
    345     if error[0]:
    346         typ, exc, tb = error[0]
--> 347         raise exc.with_traceback(tb)
    348     else:
    349         return result[0]

C:\ProgramData\Anaconda3\lib\site-packages\distributed\utils.py in f()
    329             if callback_timeout is not None:
    330                 future = asyncio.wait_for(future, callback_timeout)
--> 331             result[0] = yield future
    332         except Exception as exc:
    333             error[0] = sys.exc_info()

C:\ProgramData\Anaconda3\lib\site-packages\tornado\gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

C:\ProgramData\Anaconda3\lib\site-packages\distributed\client.py in
_gather(self, futures, errors, direct, local_worker)    1824                             exc = CancelledError(key)    1825                         else:
-> 1826                             raise exception.with_traceback(traceback)    1827                        
raise exc    1828                     if errors == "skip":

KilledWorker:
("('from-delayed-pandas_read_text-read-block-getitem-63774c26477d7b369d337b87bd2c5520',
587)", <Worker 'tcp://192.168.1.152:62050', name: 0, memory: 0,
processing: 381>)

I'm doing this for fun, learning the limitations, I would dearly love to have the remote PC helping this out, but every time it reads a CSV, and the whole point of getting into dask is that it doesn't have to read them all at the same time, and then hold them all in RAM at the SAME time... I know another 32GB of RAM is only $300 but i'm trying to learn the limits of this like this.我这样做是为了好玩,了解局限性,我非常希望远程 PC 能帮助解决这个问题,但每次它读取 CSV 时,进入 dask 的全部意义在于它不必读取同时将它们全部保存,然后同时将它们全部保存在 RAM 中……我知道另外 32GB 的 RAM 只需 300 美元,但我正试图像这样了解这种限制。

The program I have coded already once in Panda's.我已经在 Panda's 中编写过一次的程序。 I have to run rolling windows across this 216m line data set, after I have the right columns in the right units.在我拥有正确单位的正确列之后,我必须在这个 216m 线数据集上运行滚动 windows。 the distributed nature of Dask is terribly fitting. Dask 的分布式特性非常合适。 and this data set is only 6 hours of a two month data set.而这个数据集只有两个月数据集的 6 个小时。 I want this to work on this "small" subset, so that I can extrapolate to bigger and better things.我希望它适用于这个“小”子集,以便我可以推断出更大更好的东西。 It runs for 35 minutes then sucks the whole memory and pagefile dry, and locks up the system.它运行 35 分钟,然后吸干整个 memory 和页面文件,并锁定系统。

Ideally, dask is updated to allow local file reads by local workers only?理想情况下,dask 已更新为仅允许本地工作人员读取本地文件? via the program only?仅通过程序? I only have local machines, no web clusters, no web big-data services at my disposal.我只有本地机器,没有 web 集群,没有 web 大数据服务可供我使用。 Just a little QNAP NAS and a couple of new Ryzen PC's.只需一点 QNAP NAS 和几台新的 Ryzen PC。 Learning Python in self-isolation on the weekends.周末自我隔离学习Python。

Thoughts?想法? Go!去!

It looks like you have two machines: - mac mini - A PC看起来你有两台机器: - mac mini - 一台 PC

In the first execution of your workflow, you started to read CSVs based on a local file path and C:/...csv and your other worker does not have that file path.在您的工作流程的第一次执行中,您开始根据本地文件路径和C:/...csv读取 CSV,而您的其他工作人员没有该文件路径。 You then tried to use FTP but this failed immediately.然后您尝试使用 FTP 但这立即失败了。

A few things: 1) Make sure ftp is setup correctly and all machines can access files correctly 2) Consult https://filesystem-spec.readthedocs.io/en/latest/usage.html?highlight=ftp#instantiate-a-file-system to see if you need to pass additional options? A few things: 1) Make sure ftp is setup correctly and all machines can access files correctly 2) Consult https://filesystem-spec.readthedocs.io/en/latest/usage.html?highlight=ftp#instantiate-a-文件系统看看你是否需要传递额外的选项? 3) If FTP is complicated, you can always use HTTP+python to serve files with python -m http.server 3) 如果 FTP 很复杂,你总是可以使用 HTTP+python 来提供文件python -m http.server

You could also play with Resource Restrictions to restrict a subset of workers for reading in the data.您还可以使用资源限制来限制一部分工作人员读取数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM