I am trying to write a data pipeline using dask distributed and some legacy code from a previous project. get_data
simply get url:str and session:ClientSession as arguments and return a pandas DataFrame.
from dask.distributed import Client
from aiohttp import ClientSession
client = Client()
session: ClientSession = connector.session_factory()
futures = client.map(
get_data, # function to get data (takes url and http session)
urls,
[session for _ in range(len(urls))], # PROBLEM IS HERE
retries=5,
)
r = client.map(loader.job, futures)
_ = client.gather(r)
I get the following error
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/worker.py", line 2952, in warn_dumps
b = dumps(obj)
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/distributed/protocol/pickle.py", line 58, in dumps
result = cloudpickle.dumps(x, **dump_kwargs)
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 73, in dumps
cp.dump(obj)
File "/home/zar3bski/.cache/pypoetry/virtualenvs/poc-dask-iG-N0GH5-py3.10/lib/python3.10/site-packages/cloudpickle/cloudpickle_fast.py", line 632, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle 'TaskStepMethWrapper' object
Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f3042b2fa00>
My temptation was then to register a serializer and a deserializer for this exotic object following this doc
from distributed.protocol import dask_serialize, dask_deserialize
@dask_serialize.register(TaskStepMethWrapper)
def serialize(ctx: TaskStepMethWrapper) -> Tuple[Dict, List[bytes]]:
header = {} #?
frames = [] #?
return header, frames
@dask_deserialize.register(TaskStepMethWrapper)
def deserialize(header: Dict, frames: List[bytes]) -> TaskStepMethWrapper:
return TaskStepMethWrapper(frames) #?
The problem is that I don't know where to load TaskStepMethWrapper
from. I know that class TaskStepMethWrapper
is asyncio related
grep -rnw './' -e '.*TaskStepMethWrapper.*'
grep: ./lib-dynload/_asyncio.cpython-310-x86_64-linux-gnu.so : fichiers binaires correspondent
But I couldn't find its definition anywhere in site-packages/aiohttp . I also tried to use a Client(asynchronous=True)
with only resulted in a TypeError: cannot pickle '_contextvars.Context' object
.
How do you handle exotic objects serializations in dask. Should I extend the dask serializer or use an additional serialization family ?
client = Client('tcp://scheduler-address:8786',
serializers=['dask', 'pickle'], # BUT WHICH ONE
deserializers=['dask', 'msgpack']) # BUT WHICH ONE
There is a far easier to get around this: create your sessions within the mapped function. You would have been recreating the sessions in each worker anyway, they cannot survive a transfer
from dask.distributed import Client
from aiohttp import ClientSession
client = Client()
def func(u):
session: ClientSession = connector.session_factory()
return get_data(u, session)
futures = client.map(
func,
urls,
retries=5,
)
(I don't know what loader.job is, so I have omitted that).
Note that TaskStepMethWrapper
(and anything to do with aiohttp) sounds like it should be called only in async code. Maybe func
needs to be async and you need appropriate await
s.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.