简体   繁体   English

如何使一个巨大的循环检查数据帧中的条件在mac中运行得更快

[英]How to make a huge loop for checking condition in dataframe run faster in mac

I need to calculate a huge table value (157954 rows and 365 columns) by checking three conditions in a dataframe with 11 mil rows.我需要通过检查具有 11 百万行的数据框中的三个条件来计算一个巨大的表值(157954 行和 365 列)。 Do you have any way to speed up the calculation, which is taking more than 10 hours now?你有什么方法可以加快计算,现在需要10多个小时?

I have 367 stations in total.我总共有 367 个站点。

for station in stations:
    no_pickup_array = []
    for time_point in data_matrix['Timestamp']:
        time_point_2 = time_point + timedelta(minutes=15)
        no_pickup = len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2) 
                        & (dataframe['departure_name'] == station)])
        no_pickup_array.append(no_pickup)
    print(f"Station name: {station}")
    data_matrix[station] = no_pickup_array

I appreciate any of your help.我感谢您的任何帮助。

@ To all: Thank you for your comments, I add more info for my problem. @对所有人:感谢您的评论,我为我的问题添加了更多信息。

这是检查条件的数据框

Each row of dataframe is info of each renting bike.每一行数据框都是每辆租用自行车的信息。 I want to create a matrix with number of bikes picked up at each station for each 15 minutes interval.我想创建一个矩阵,其中每 15 分钟间隔在每个车站拾取的自行车数量。 Then I also want to calculate the average speed, average time,.. as well.然后我还想计算平均速度,平均时间,..以及。

The solution from @Jérôme Richard could reduce the number of calculations, but I still struggle to understand and implement indexing steps and apply logarithmic search or binary search. @Jérôme Richard 的解决方案可以减少计算次数,但我仍然难以理解和实施索引步骤以及应用对数搜索或二进制搜索。

 index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}
# code @Jérôme Richard recommended

The main problem is the right-hand-side of the no_pickup assignment expression which is algorithmically inefficient because it makes a linear search while a logarithmic search is possible.主要问题是no_pickup赋值表达式的右侧,它在算法上效率低下,因为它进行线性搜索对数搜索是可能的。

The first thing to do is to do a groupby of dataframe so to build an index enabling to fetch the dataframe subset having a given name.首先要做的是对dataframe进行groupby ,以便建立一个索引,以获取具有给定名称的数据帧子集。 Then, you can sort each dataframe subset by departure so to be able to perform a binary search enabling you to know the number of item fitting the condition.然后,您可以按出发对每个数据帧子集进行排序,以便能够执行二进制搜索,从而使您知道符合条件的项目数。

The index can be built with something like:可以使用以下内容构建索引:

index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}

Finally, you can do the binary search with two np.searchsorted on index[station] : one to know the starting index and one to know the ending index.最后,您可以在index[station]上使用两个np.searchsorted进行二分搜索:一个知道起始索引,一个知道结束索引。 You can get the length with a simple subtraction of the two.您可以通过两者的简单减法得到长度。

Note that you may need some tweak since I am not sure the above code will works on your dataset but it is hard to know without an example of code generating the inputs.请注意,您可能需要进行一些调整,因为我不确定上述代码是否适用于您的数据集,但如果没有生成输入的代码示例,很难知道。

You're indexing the dataframe list with a boolean (which will be zero or one, so you're only ever going to get the length of the first or second element) instead of a number.您正在使用布尔值(将是零或一,因此您只会获得第一个或第二个元素的长度)而不是数字来索引数据框列表。 It's going to get evaluated like so:它将像这样被评估:

len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2) & (dataframe['departure_name'] == station)])
len(dataframe[True & False & True]) # let's just say the variables work out like this
len(dataframe[False])
len(dataframe[0])

This probably isn't the behavior you're after.这可能不是您所追求的行为。 (let me know what you're trying to do in a comment and I'll try to help out more.) (让我知道您在评论中想要做什么,我会尽力提供更多帮助。)

In terms of code speed specifically, & is bitwise "AND", in python the boolean operators are written out as and , or , and not .具体就代码速度而言, &是按位“与”,在 python 中,布尔运算符被写为andor , and not Using and here would speed up your code, since python only evaluates parts of boolean expressions where they're needed, eg在这里使用and会加速你的代码,因为 python 只在需要它们的地方评估部分布尔表达式,例如

from time import sleep
def slow_function():
    sleep(3)
    return False

# This line doesn't take 3 seconds to run as you may expect.
# Python sees "False and" and is smart enough to realize that whatever comes after is irrelevant. 
# No matter what comes after "False and", it's never going to make the first half True.
# So, python doesn't bother evaluating it, and saves 3 seconds in the process.
False and slow_function()

# Some more examples that show python doesn't evaluate the right half unless it needs to
False and print("hi")
False and asdfasdfasdf
False and 42/0

# The same does not happen here. These are bitwise operators, expected to be applied to numbers.
# While it does produce the correct result for boolean inputs, it's going to be slower,
# since it can't rely on the same optimization.
False & slow_function()

# Of course, both of these still take 3 seconds, since the right half has to be evaluated either way.
True and slow_function()
True & slow_function()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM