简体   繁体   English

dask.distributed:处理奇异对象的序列化?

[英]dask.distributed: handle serialization of exotic objects?

Context语境

I am trying to write a data pipeline using dask distributed and some legacy code from a previous project.我正在尝试使用 dask distributed 和以前项目中的一些遗留代码编写数据管道。 get_data simply get url:str and session:ClientSession as arguments and return a pandas DataFrame. get_data只需将url:strsession:ClientSession获取为 arguments 并返回 pandas DataFrame。

from dask.distributed import Client
from aiohttp import ClientSession
client = Client()
session: ClientSession = connector.session_factory()

futures = client.map(
    get_data, # function to get data (takes url and http session)
    urls,
    [session for _ in range(len(urls))],  # PROBLEM IS HERE
    retries=5,
)
r = client.map(loader.job, futures)
_ = client.gather(r)

Problem问题

I get the following error我收到以下错误

 File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/worker.py", line 2952, in warn_dumps
    b = dumps(obj)
  File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 58, in dumps
    result = cloudpickle.dumps(x, **dump_kwargs)
  File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 632, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'TaskStepMethWrapper' object
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f3042b2fa00>

My temptation was then to register a serializer and a deserializer for this exotic object following this doc然后我很想在 这个文档之后为这个奇异的 object 注册一个序列化器和一个反序列化器

from distributed.protocol import dask_serialize, dask_deserialize
@dask_serialize.register(TaskStepMethWrapper)
def serialize(ctx: TaskStepMethWrapper) -> Tuple[Dict, List[bytes]]:
    header = {} #?
    frames = [] #?
    return header, frames


@dask_deserialize.register(TaskStepMethWrapper)
def deserialize(header: Dict, frames: List[bytes]) -> TaskStepMethWrapper:
    return TaskStepMethWrapper(frames) #?

The problem is that I don't know where to load TaskStepMethWrapper from.问题是我不知道从哪里加载TaskStepMethWrapper I know that class TaskStepMethWrapper is asyncio related我知道 class TaskStepMethWrapper是 asyncio 相关的

grep -rnw './' -e '.*TaskStepMethWrapper.*'
grep: ./lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so : fichiers binaires correspondent

But I couldn't find its definition anywhere in site-packages/aiohttp .但是我在site-packages/aiohttp 的任何地方都找不到它的定义。 I also tried to use a Client(asynchronous=True) with only resulted in a TypeError: cannot pickle '_contextvars.Context' object .我还尝试使用Client(asynchronous=True) ,但只导致TypeError: cannot pickle '_contextvars.Context' object

How do you handle exotic objects serializations in dask.你如何处理 dask 中的奇异对象序列化。 Should I extend the dask serializer or use an additional serialization family ?我应该扩展 dask 序列化程序还是使用其他序列化系列

client = Client('tcp://scheduler-address:8786',
                serializers=['dask', 'pickle'], # BUT WHICH ONE
                deserializers=['dask', 'msgpack']) # BUT WHICH ONE

There is a far easier to get around this: create your sessions within the mapped function. You would have been recreating the sessions in each worker anyway, they cannot survive a transfer有一个更容易解决这个问题的方法:在映射的 function 中创建你的会话。无论如何你都会在每个工作人员中重新创建会话,他们无法在转移中幸存下来

from dask.distributed import Client
from aiohttp import ClientSession
client = Client()

def func(u):
    session: ClientSession = connector.session_factory()
    return get_data(u, session)

futures = client.map(
    func,
    urls,
    retries=5,
)

(I don't know what loader.job is, so I have omitted that). (我不知道 loader.job 是什么,所以我省略了它)。

Note that TaskStepMethWrapper (and anything to do with aiohttp) sounds like it should be called only in async code.请注意, TaskStepMethWrapper (以及与 aiohttp 有关的任何内容)听起来应该只在异步代码中调用。 Maybe func needs to be async and you need appropriate await s.也许func需要异步并且您需要适当的await

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM