如何使一个巨大的循环检查数据帧中的条件在mac中运行得更快

Question

I need to calculate a huge table value (157954 rows and 365 columns) by checking three conditions in a dataframe with 11 mil rows.我需要通过检查具有 11 百万行的数据框中的三个条件来计算一个巨大的表值（157954 行和 365 列）。 Do you have any way to speed up the calculation, which is taking more than 10 hours now?你有什么方法可以加快计算，现在需要10多个小时？

I have 367 stations in total.我总共有 367 个站点。

for station in stations:
    no_pickup_array = []
    for time_point in data_matrix['Timestamp']:
        time_point_2 = time_point + timedelta(minutes=15)
        no_pickup = len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2) 
                        & (dataframe['departure_name'] == station)])
        no_pickup_array.append(no_pickup)
    print(f"Station name: {station}")
    data_matrix[station] = no_pickup_array

I appreciate any of your help.我感谢您的任何帮助。

@ To all: Thank you for your comments, I add more info for my problem. @对所有人：感谢您的评论，我为我的问题添加了更多信息。

Each row of dataframe is info of each renting bike.每一行数据框都是每辆租用自行车的信息。 I want to create a matrix with number of bikes picked up at each station for each 15 minutes interval.我想创建一个矩阵，其中每 15 分钟间隔在每个车站拾取的自行车数量。 Then I also want to calculate the average speed, average time,.. as well.然后我还想计算平均速度，平均时间，..以及。

The solution from @Jérôme Richard could reduce the number of calculations, but I still struggle to understand and implement indexing steps and apply logarithmic search or binary search. @Jérôme Richard 的解决方案可以减少计算次数，但我仍然难以理解和实施索引步骤以及应用对数搜索或二进制搜索。

 index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}
# code @Jérôme Richard recommended

Answer 1

The main problem is the right-hand-side of the no_pickup assignment expression which is algorithmically inefficient because it makes a linear search while a logarithmic search is possible.主要问题是no_pickup赋值表达式的右侧，它在算法上效率低下，因为它进行线性搜索而对数搜索是可能的。

The first thing to do is to do a groupby of dataframe so to build an index enabling to fetch the dataframe subset having a given name.首先要做的是对dataframe进行groupby ，以便建立一个索引，以获取具有给定名称的数据帧子集。 Then, you can sort each dataframe subset by departure so to be able to perform a binary search enabling you to know the number of item fitting the condition.然后，您可以按出发对每个数据帧子集进行排序，以便能够执行二进制搜索，从而使您知道符合条件的项目数。

The index can be built with something like:可以使用以下内容构建索引：

index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}

Finally, you can do the binary search with two np.searchsorted on index[station] : one to know the starting index and one to know the ending index.最后，您可以在index[station]上使用两个np.searchsorted进行二分搜索：一个知道起始索引，一个知道结束索引。 You can get the length with a simple subtraction of the two.您可以通过两者的简单减法得到长度。

Note that you may need some tweak since I am not sure the above code will works on your dataset but it is hard to know without an example of code generating the inputs.请注意，您可能需要进行一些调整，因为我不确定上述代码是否适用于您的数据集，但如果没有生成输入的代码示例，很难知道。

Answer 2

You're indexing the dataframe list with a boolean (which will be zero or one, so you're only ever going to get the length of the first or second element) instead of a number.您正在使用布尔值（将是零或一，因此您只会获得第一个或第二个元素的长度）而不是数字来索引数据框列表。 It's going to get evaluated like so:它将像这样被评估：

len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2) & (dataframe['departure_name'] == station)])
len(dataframe[True & False & True]) # let's just say the variables work out like this
len(dataframe[False])
len(dataframe[0])

This probably isn't the behavior you're after.这可能不是您所追求的行为。 (let me know what you're trying to do in a comment and I'll try to help out more.) （让我知道您在评论中想要做什么，我会尽力提供更多帮助。）

In terms of code speed specifically, & is bitwise "AND", in python the boolean operators are written out as and , or , and not .具体就代码速度而言， &是按位“与”，在 python 中，布尔运算符被写为and ， or ， and not 。 Using and here would speed up your code, since python only evaluates parts of boolean expressions where they're needed, eg在这里使用and会加速你的代码，因为 python 只在需要它们的地方评估部分布尔表达式，例如

from time import sleep
def slow_function():
    sleep(3)
    return False

# This line doesn't take 3 seconds to run as you may expect.
# Python sees "False and" and is smart enough to realize that whatever comes after is irrelevant. 
# No matter what comes after "False and", it's never going to make the first half True.
# So, python doesn't bother evaluating it, and saves 3 seconds in the process.
False and slow_function()

# Some more examples that show python doesn't evaluate the right half unless it needs to
False and print("hi")
False and asdfasdfasdf
False and 42/0

# The same does not happen here. These are bitwise operators, expected to be applied to numbers.
# While it does produce the correct result for boolean inputs, it's going to be slower,
# since it can't rely on the same optimization.
False & slow_function()

# Of course, both of these still take 3 seconds, since the right half has to be evaluated either way.
True and slow_function()
True & slow_function()

如何使一个巨大的循环检查数据帧中的条件在mac中运行得更快

问题描述

2 个解决方案

解决方案1
1 2022-06-16 23:55:11

解决方案2
1 2022-06-17 07:46:38

如何使一个巨大的循环检查数据帧中的条件在mac中运行得更快

问题描述

2 个解决方案

解决方案1 1 2022-06-16 23:55:11

解决方案2 1 2022-06-17 07:46:38

解决方案1
1 2022-06-16 23:55:11

解决方案2
1 2022-06-17 07:46:38