简体   繁体   中英

How do I async pickle a lot of files with aiofiles?

I have a list I'd like to write out, data , one file for each item like so:

for i,chunk in enumerate(data):
    fname = ROOT / f'{i}.in'
    with open(fname, "wb") as fout:
        dill.dump(chunk, fout)

Since the data list can be quite long and I'm writing to a network storage location, I'm spending a lot of time waiting for the iteration in NFS back and forth, and I'd like to do this asynchronously if possible.

I have something that basically looks like this now:

import dill
import asyncio
import aiofiles
from pathlib import Path

ROOT = Path("/tmp/")

data = [str(i) for i in range(500)]

def serialize(data):
  """
  Write my data out in serial
  """
  for i,chunk in enumerate(data):
    fname = ROOT / f'{i}.in'
    print(fname)
    with open(fname, "wb") as fout:
        dill.dump(chunk, fout)

def aserialize(data):
  """
  Same as above, but writes my data out asynchronously
  """
  fnames = [ROOT / f'{i}.in' for i in range(len(data))]
  chunks = data
  async def write_file(i):
    fname = fnames[i]
    chunk = chunks[i]
    print(fname)
    async with aiofiles.open(fname, "wb") as fout:
        print(f"written: {i}")
        dill.dump(chunk, fout)
        await fout.flush()
  loop = asyncio.get_event_loop()
  loop.run_until_complete(asyncio.gather(*[write_file(i) for i in range(len(data))]))

Now, when I test the writes, this looks fast enough to be worthwhile on my NFS:

# test 1
start = datetime.utcnow()
serialize(data)
end = datetime.utcnow()
print(end - start)
# >>> 0:02:04.204681

# test 3
start = datetime.utcnow()
aserialize(data)
end = datetime.utcnow()
print(end - start)
# >>> 0:00:27.048893
# faster is better.

But when I actually /de/-serialize the data I wrote, I see that maybe it was fast because it wasn't writing anything:

def deserialize(dat):
  tmp = []
  for i in range(len(dat)):
    fname = ROOT / f'{i}.in'
    with open(fname, "rb") as fin:
      fo = dill.load(fin)
    tmp.append(fo)
  return tmp

serialize(data)
d2 = deserialize(data)
d2 == data
# True

Good, whereas:

aserialize(data)
d3 = deserialize(data)
>>> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in deserialize
  File "...python3.7/site-packages/dill/_dill.py", line 305, in load
    obj = pik.load()
EOFError: Ran out of input

That is, the asynchronously written files are empty. No wonder it was so fast.

How can I dill/pickle my list into files asynchronously and get them to actually write? I assume I need to await the dill.dump somehow? I thought the fout.flush would handle that, but seems not.

I changed the line dill.dump(chunk, fout) to await fout.write(dill.dumps(chunk)) and got the data written to the files and deserialized correctly. Seems like dill.dump works only with regular synchronous files calling file.write method without await keyword.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM