I have a list I'd like to write out, data
, one file for each item like so:
for i,chunk in enumerate(data):
fname = ROOT / f'{i}.in'
with open(fname, "wb") as fout:
dill.dump(chunk, fout)
Since the data list can be quite long and I'm writing to a network storage location, I'm spending a lot of time waiting for the iteration in NFS back and forth, and I'd like to do this asynchronously if possible.
I have something that basically looks like this now:
import dill
import asyncio
import aiofiles
from pathlib import Path
ROOT = Path("/tmp/")
data = [str(i) for i in range(500)]
def serialize(data):
"""
Write my data out in serial
"""
for i,chunk in enumerate(data):
fname = ROOT / f'{i}.in'
print(fname)
with open(fname, "wb") as fout:
dill.dump(chunk, fout)
def aserialize(data):
"""
Same as above, but writes my data out asynchronously
"""
fnames = [ROOT / f'{i}.in' for i in range(len(data))]
chunks = data
async def write_file(i):
fname = fnames[i]
chunk = chunks[i]
print(fname)
async with aiofiles.open(fname, "wb") as fout:
print(f"written: {i}")
dill.dump(chunk, fout)
await fout.flush()
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.gather(*[write_file(i) for i in range(len(data))]))
Now, when I test the writes, this looks fast enough to be worthwhile on my NFS:
# test 1
start = datetime.utcnow()
serialize(data)
end = datetime.utcnow()
print(end - start)
# >>> 0:02:04.204681
# test 3
start = datetime.utcnow()
aserialize(data)
end = datetime.utcnow()
print(end - start)
# >>> 0:00:27.048893
# faster is better.
But when I actually /de/-serialize the data I wrote, I see that maybe it was fast because it wasn't writing anything:
def deserialize(dat):
tmp = []
for i in range(len(dat)):
fname = ROOT / f'{i}.in'
with open(fname, "rb") as fin:
fo = dill.load(fin)
tmp.append(fo)
return tmp
serialize(data)
d2 = deserialize(data)
d2 == data
# True
Good, whereas:
aserialize(data)
d3 = deserialize(data)
>>> Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 6, in deserialize
File "...python3.7/site-packages/dill/_dill.py", line 305, in load
obj = pik.load()
EOFError: Ran out of input
That is, the asynchronously written files are empty. No wonder it was so fast.
How can I dill/pickle my list into files asynchronously and get them to actually write? I assume I need to await the dill.dump somehow? I thought the fout.flush would handle that, but seems not.
I changed the line dill.dump(chunk, fout)
to await fout.write(dill.dumps(chunk))
and got the data written to the files and deserialized correctly. Seems like dill.dump
works only with regular synchronous files calling file.write
method without await
keyword.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.