简体   繁体   English

如何使用 aiofiles 异步腌制大量文件?

[英]How do I async pickle a lot of files with aiofiles?

I have a list I'd like to write out, data , one file for each item like so:我有一个我想写出的列表, data ,每个项目一个文件,如下所示:

for i,chunk in enumerate(data):
    fname = ROOT / f'{i}.in'
    with open(fname, "wb") as fout:
        dill.dump(chunk, fout)

Since the data list can be quite long and I'm writing to a network storage location, I'm spending a lot of time waiting for the iteration in NFS back and forth, and I'd like to do this asynchronously if possible.由于数据列表可能很长,而且我正在写入网络存储位置,因此我花了很多时间在 NFS 中来回等待迭代,如果可能,我想异步执行此操作。

I have something that basically looks like this now:我现在有一些基本上看起来像这样的东西:

import dill
import asyncio
import aiofiles
from pathlib import Path

ROOT = Path("/tmp/")

data = [str(i) for i in range(500)]

def serialize(data):
  """
  Write my data out in serial
  """
  for i,chunk in enumerate(data):
    fname = ROOT / f'{i}.in'
    print(fname)
    with open(fname, "wb") as fout:
        dill.dump(chunk, fout)

def aserialize(data):
  """
  Same as above, but writes my data out asynchronously
  """
  fnames = [ROOT / f'{i}.in' for i in range(len(data))]
  chunks = data
  async def write_file(i):
    fname = fnames[i]
    chunk = chunks[i]
    print(fname)
    async with aiofiles.open(fname, "wb") as fout:
        print(f"written: {i}")
        dill.dump(chunk, fout)
        await fout.flush()
  loop = asyncio.get_event_loop()
  loop.run_until_complete(asyncio.gather(*[write_file(i) for i in range(len(data))]))

Now, when I test the writes, this looks fast enough to be worthwhile on my NFS:现在,当我测试写入时,这看起来足够快,值得在我的 NFS 上使用:

# test 1
start = datetime.utcnow()
serialize(data)
end = datetime.utcnow()
print(end - start)
# >>> 0:02:04.204681

# test 3
start = datetime.utcnow()
aserialize(data)
end = datetime.utcnow()
print(end - start)
# >>> 0:00:27.048893
# faster is better.

But when I actually /de/-serialize the data I wrote, I see that maybe it was fast because it wasn't writing anything:但是当我实际 /de/-serialize 我写的数据时,我发现它可能很快,因为它没有写任何东西:

def deserialize(dat):
  tmp = []
  for i in range(len(dat)):
    fname = ROOT / f'{i}.in'
    with open(fname, "rb") as fin:
      fo = dill.load(fin)
    tmp.append(fo)
  return tmp

serialize(data)
d2 = deserialize(data)
d2 == data
# True

Good, whereas:很好,而:

aserialize(data)
d3 = deserialize(data)
>>> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 6, in deserialize
  File "...python3.7/site-packages/dill/_dill.py", line 305, in load
    obj = pik.load()
EOFError: Ran out of input

That is, the asynchronously written files are empty.即异步写入的文件为空。 No wonder it was so fast.怪不得这么快。

How can I dill/pickle my list into files asynchronously and get them to actually write?如何将我的列表异步挖掘/腌制到文件中并让它们实际写入? I assume I need to await the dill.dump somehow?我想我需要以某种方式等待 dill.dump 吗? I thought the fout.flush would handle that, but seems not.我以为 fout.flush 会处理这个问题,但似乎不是。

I changed the line dill.dump(chunk, fout) to await fout.write(dill.dumps(chunk)) and got the data written to the files and deserialized correctly.我将行dill.dump(chunk, fout)更改为await fout.write(dill.dumps(chunk))并将数据写入文件并正确反序列化。 Seems like dill.dump works only with regular synchronous files calling file.write method without await keyword.似乎dill.dump仅适用于调用file.write方法而没有await关键字的常规同步文件。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM