简体   繁体   中英

dask.distributed: handle serialization of exotic objects?

Context

I am trying to write a data pipeline using dask distributed and some legacy code from a previous project. get_data simply get url:str and session:ClientSession as arguments and return a pandas DataFrame.

from dask.distributed import Client
from aiohttp import ClientSession
client = Client()
session: ClientSession = connector.session_factory()

futures = client.map(
    get_data, # function to get data (takes url and http session)
    urls,
    [session for _ in range(len(urls))],  # PROBLEM IS HERE
    retries=5,
)
r = client.map(loader.job, futures)
_ = client.gather(r)

Problem

I get the following error

 File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/worker.py", line 2952, in warn_dumps
    b = dumps(obj)
  File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 58, in dumps
    result = cloudpickle.dumps(x, **dump_kwargs)
  File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
    cp.dump(obj)
  File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 632, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle 'TaskStepMethWrapper' object
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f3042b2fa00>

My temptation was then to register a serializer and a deserializer for this exotic object following this doc

from distributed.protocol import dask_serialize, dask_deserialize
@dask_serialize.register(TaskStepMethWrapper)
def serialize(ctx: TaskStepMethWrapper) -> Tuple[Dict, List[bytes]]:
    header = {} #?
    frames = [] #?
    return header, frames


@dask_deserialize.register(TaskStepMethWrapper)
def deserialize(header: Dict, frames: List[bytes]) -> TaskStepMethWrapper:
    return TaskStepMethWrapper(frames) #?

The problem is that I don't know where to load TaskStepMethWrapper from. I know that class TaskStepMethWrapper is asyncio related

grep -rnw './' -e '.*TaskStepMethWrapper.*'
grep: ./lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so : fichiers binaires correspondent

But I couldn't find its definition anywhere in site-packages/aiohttp . I also tried to use a Client(asynchronous=True) with only resulted in a TypeError: cannot pickle '_contextvars.Context' object .

How do you handle exotic objects serializations in dask. Should I extend the dask serializer or use an additional serialization family ?

client = Client('tcp://scheduler-address:8786',
                serializers=['dask', 'pickle'], # BUT WHICH ONE
                deserializers=['dask', 'msgpack']) # BUT WHICH ONE

There is a far easier to get around this: create your sessions within the mapped function. You would have been recreating the sessions in each worker anyway, they cannot survive a transfer

from dask.distributed import Client
from aiohttp import ClientSession
client = Client()

def func(u):
    session: ClientSession = connector.session_factory()
    return get_data(u, session)

futures = client.map(
    func,
    urls,
    retries=5,
)

(I don't know what loader.job is, so I have omitted that).

Note that TaskStepMethWrapper (and anything to do with aiohttp) sounds like it should be called only in async code. Maybe func needs to be async and you need appropriate await s.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM