为什么使用Python异步从文件读取和调用API比同步慢？

Question

I have a large file, with a JSON record on each line. 我有一个大文件，每行都有JSON记录。 I'm writing a script to upload a subset of these records to CouchDB via the API, and experimenting with different approaches to see what works the fastest. 我正在编写一个脚本，以通过API将这些记录的子集上载到CouchDB，并尝试各种方法来查看最快的方法。 Here's what I've found to work fastest to slowest (on a CouchDB instance on my localhost): 这是我发现工作最快或最慢的方法（在本地主机上的CouchDB实例上）：

Read each needed record into memory. 将每个所需的记录读入内存。 After all records are in memory, generate an upload coroutine for each record, and gather/run all the coroutines at once 将所有记录存储在内存中后，为每个记录生成一个上传协程，并立即收集/运行所有协程
Synchronously read file and when a needed record is encountered, synchronously upload 同步读取文件，并在遇到所需记录时同步上传
Use aiofiles to read the file, and when a needed record is encountered, asynchronously update 使用aiofiles读取文件，并在遇到所需记录时异步更新

Approach #1 is much faster than the other two (about twice as fast). 方法1比其他两种方法快得多（大约快一倍）。 I am confused why approach #2 is faster than #3, especially in contrast to this example here , which takes half as much time to run asynchronously than synchronously (sync code not provided, had to rewrite it myself). 我很困惑为什么方法2比方法3更快，尤其是与此处的示例相反，该示例异步运行所需的时间比同步运行所需的时间少一半（未提供同步代码，必须自己重写）。 Is it the context switching from file i/o to HTTP i/o, especially with file reads ocurring much more often than API uploads? 是上下文从文件I / O切换到HTTP I / O，尤其是与API上传相比，文件读取的发生频率要高得多吗？

For additional illustration, here's some Python pseudo-code that represents each approach: 为了进一步说明，这是代表每种方法的一些Python伪代码：

Approach 1 - Sync File IO, Async HTTP IO 方法1-同步文件IO，异步HTTP IO

import json
import asyncio
import aiohttp

records = []
with open('records.txt', 'r') as record_file:
    for line in record_file:
        record = json.loads(line)
        if valid(record):
            records.append(record)

async def batch_upload(records):
    async with aiohttp.ClientSession() as session:
        tasks = []
        for record in records:
            task = async_upload(record, session)
            tasks.append(task)  
        await asyncio.gather(*tasks)

asyncio.run(batch_upload(properties))

Approach 2 - Sync File IO, Sync HTTP IO 方法2-同步文件IO，同步HTTP IO

import json

with open('records.txt', 'r') as record_file:
    for line in record_file:
        record = json.loads(line)
        if valid(record):
            sync_upload(record)

Approach 3 - Async File IO, Async HTTP IO 方法3-异步文件IO，异步HTTP IO

import json
import asyncio
import aiohttp
import aiofiles

async def batch_upload()
    async with aiohttp.ClientSession() as session:
        async with open('records.txt', 'r') as record_file:
            line = await record_file.readline()
            while line:
                record = json.loads(line)
                if valid(record):
                    await async_upload(record, session)
                line = await record_file.readline()

asyncio.run(batch_upload())

The file I'm developing this with is about 1.3 GB, with 100000 records total, 691 of which I upload. 我正在使用的文件大小约为1.3 GB，共有100000条记录，我上传了691条。 Each upload begins with a GET request to see if the record already exists in CouchDB. 每次上传均以GET请求开头，以查看记录是否已存在于CouchDB中。 If it does, then a PUT is performed to update the CouchDB record with any new information; 如果是这样，则执行PUT以使用任何新信息来更新CouchDB记录； if it doesn't, then a the record is POSTed to the db. 如果不是，则将记录过帐到数据库。 So, each upload consists of two API requests. 因此，每个上传都包含两个API请求。 For dev purposes, I'm only creating records, so I run the GET and POST requests, 1382 API calls total. 出于开发目的，我仅创建记录，因此我运行GET和POST请求，总共1382个API调用。

Approach #1 takes about 17 seconds, approach #2 takes about 33 seconds, and approach #3 takes about 42 seconds. 方法1大约需要17秒，方法2大约需要33秒，方法3大约需要42秒。

Answer 1

your code uses async but it does the work synchronously and in this case it will be slower than the sync approach. 您的代码使用了异步，但是它是同步完成的，因此在这种情况下，它比同步方法要慢。 Asyc won't speed up the execution if not constructed/used effectively. 如果没有有效地构建/使用，Asyc将不会加快执行速度。

You can create 2 coroutines and make them run in parallel.. perhaps that speeds up the operation. 您可以创建2个协程并使它们并行运行..也许可以加快操作速度。

Example: 例：

#!/usr/bin/env python3

import asyncio


async def upload(event, queue):
    # This logic is not so correct when it comes to shutdown,
    # but gives the idea
    while not event.is_set():
        record = await queue.get()
        print(f'uploading record : {record}')
    return


async def read(event, queue):
    # dummy logic : instead read here and populate the queue.
    for i in range(1, 10):
        await queue.put(i)
    # Initiate shutdown..
    event.set()


async def main():
    event = asyncio.Event()
    queue = asyncio.Queue()

    uploader = asyncio.create_task(upload(event, queue))
    reader = asyncio.create_task(read(event, queue))
    tasks = [uploader, reader]

    await asyncio.gather(*tasks)


if __name__ == '__main__':
    asyncio.run(main())

为什么使用Python异步从文件读取和调用API比同步慢？

问题描述

Approach 1 - Sync File IO, Async HTTP IO 方法1-同步文件IO，异步HTTP IO

Approach 2 - Sync File IO, Sync HTTP IO 方法2-同步文件IO，同步HTTP IO

Approach 3 - Async File IO, Async HTTP IO 方法3-异步文件IO，异步HTTP IO

1 个解决方案

解决方案1
1 已采纳 2019-08-20 18:10:01

为什么使用Python异步从文件读取和调用API比同步慢？

问题描述

Approach 1 - Sync File IO, Async HTTP IO 方法1-同步文件IO，异步HTTP IO

Approach 2 - Sync File IO, Sync HTTP IO 方法2-同步文件IO，同步HTTP IO

Approach 3 - Async File IO, Async HTTP IO 方法3-异步文件IO，异步HTTP IO

1 个解决方案

解决方案1 1 已采纳 2019-08-20 18:10:01

解决方案1
1 已采纳 2019-08-20 18:10:01