简体   繁体   English

更快地迭代 dataframe 的方法?

[英]Faster way to iterate over dataframe?

I have a dataframe where each row is a record and I need to send each record in the body of a post request.我有一个 dataframe ,其中每一行都是一条记录,我需要在发布请求的正文中发送每条记录。 Right now I am looping through the dataframe to accomplish this.现在我正在循环通过 dataframe 来完成这个。 I am constrained by the fact that each record must be posted individually.我受到每条记录必须单独发布这一事实的限制。 Is there a faster way to accomplish this?有没有更快的方法来实现这一点?

Iterating over the data frame is not the issue here.迭代数据框不是这里的问题。 The issue is you have to wait for the server to response to each of your request.问题是您必须等待服务器响应您的每个请求。 Network request takes eons compared to CPU time need to iterate over the data frame.与迭代数据帧所需的 CPU 时间相比,网络请求需要 eons。 In other words, your program is I/O bound, not CPU bound.换句话说,您的程序受 I/O 限制,而不是 CPU 限制。

One way to speed it up is to use coroutines .加速它的一种方法是使用协程 Let's say you have to make 1000 requests.假设您必须提出 1000 个请求。 Instead of firing one request, wait for the response, then fire the next request and so on, you fire 1000 requests at once and tell Python to wait until you have received all 1000 responses.不是触发一个请求,而是等待响应,然后触发下一个请求等等,您一次触发 1000 个请求并告诉 Python 等到您收到所有 1000 个响应。

Since you didn't provide any code, here's a small program to illustrate the point:由于您没有提供任何代码,这里有一个小程序来说明这一点:

import aiohttp
import asyncio
import numpy as np
import time

from typing import List

async def send_single_request(session: aiohttp.ClientSession, url: str):
    async with session.get(url) as response:
        return await response.json()

async def send_all_requests(urls: List[str]):
    async with aiohttp.ClientSession() as session:
        # Make 1 coroutine for each request
        coroutines = [send_single_request(session, url) for url in urls]
        # Wait until all coroutines have finished
        return await asyncio.gather(*coroutines)

# We will make 10 requests to httpbin.org. Each request will take at least d
# seconds. If you were to fire them sequentially, they would have taken at least
# delays.sum() seconds to complete.
np.random.seed(42)
delays = np.random.randint(0, 5, 10)
urls = [f"https://httpbin.org/delay/{d}" for d in delays]

# Instead, we will fire all 10 requests at once, then wait until all 10 have
# finished.
t1 = time.time()
result = asyncio.run(send_all_requests(urls))
t2 = time.time()

print(f"Expected time: {delays.sum()} seconds")
print(f"Actual time: {t2 - t1:.2f} seconds")

Output: Output:

Expected time: 28 seconds
Actual time: 4.57 seconds

You have to read up a bit on coroutines and how they work but for the most part, they are not too complicated for your use case.您必须阅读一些关于协程以及它们如何工作的内容,但在大多数情况下,它们对于您的用例来说并不太复杂。 This comes with a couple caveats:这有几个警告:

  1. All your requests must be independent from each other.您的所有请求必须彼此独立。
  2. The rate limit on the server must be sufficient to handle your workload.服务器上的速率限制必须足以处理您的工作负载。 For example, if it restricts you to 2 requests per minute, there is no way around that other than upgrading to different service tier.例如,如果它将您限制为每分钟 2 个请求,那么除了升级到不同的服务层之外,别无他法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM