[英]How to make a huge loop for checking condition in dataframe run faster in mac
I need to calculate a huge table value (157954 rows and 365 columns) by checking three conditions in a dataframe with 11 mil rows.我需要通过检查具有 11 百万行的数据框中的三个条件来计算一个巨大的表值(157954 行和 365 列)。 Do you have any way to speed up the calculation, which is taking more than 10 hours now?
你有什么方法可以加快计算,现在需要10多个小时?
I have 367 stations in total.我总共有 367 个站点。
for station in stations:
no_pickup_array = []
for time_point in data_matrix['Timestamp']:
time_point_2 = time_point + timedelta(minutes=15)
no_pickup = len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2)
& (dataframe['departure_name'] == station)])
no_pickup_array.append(no_pickup)
print(f"Station name: {station}")
data_matrix[station] = no_pickup_array
I appreciate any of your help.我感谢您的任何帮助。
@ To all: Thank you for your comments, I add more info for my problem. @对所有人:感谢您的评论,我为我的问题添加了更多信息。
Each row of dataframe is info of each renting bike.每一行数据框都是每辆租用自行车的信息。 I want to create a matrix with number of bikes picked up at each station for each 15 minutes interval.
我想创建一个矩阵,其中每 15 分钟间隔在每个车站拾取的自行车数量。 Then I also want to calculate the average speed, average time,.. as well.
然后我还想计算平均速度,平均时间,..以及。
The solution from @Jérôme Richard could reduce the number of calculations, but I still struggle to understand and implement indexing steps and apply logarithmic search or binary search. @Jérôme Richard 的解决方案可以减少计算次数,但我仍然难以理解和实施索引步骤以及应用对数搜索或二进制搜索。
index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}
# code @Jérôme Richard recommended
The main problem is the right-hand-side of the no_pickup
assignment expression which is algorithmically inefficient because it makes a linear search while a logarithmic search is possible.主要问题是
no_pickup
赋值表达式的右侧,它在算法上效率低下,因为它进行线性搜索而对数搜索是可能的。
The first thing to do is to do a groupby
of dataframe
so to build an index enabling to fetch the dataframe subset having a given name.首先要做的是对
dataframe
进行groupby
,以便建立一个索引,以获取具有给定名称的数据帧子集。 Then, you can sort each dataframe subset by departure so to be able to perform a binary search enabling you to know the number of item fitting the condition.然后,您可以按出发对每个数据帧子集进行排序,以便能够执行二进制搜索,从而使您知道符合条件的项目数。
The index can be built with something like:可以使用以下内容构建索引:
index = {name: df for name, df.sort_values('departure')['departure'].to_numpy() in dataframe.groupby('departure_name')}
Finally, you can do the binary search with two np.searchsorted
on index[station]
: one to know the starting index and one to know the ending index.最后,您可以在
index[station]
上使用两个np.searchsorted
进行二分搜索:一个知道起始索引,一个知道结束索引。 You can get the length with a simple subtraction of the two.您可以通过两者的简单减法得到长度。
Note that you may need some tweak since I am not sure the above code will works on your dataset but it is hard to know without an example of code generating the inputs.请注意,您可能需要进行一些调整,因为我不确定上述代码是否适用于您的数据集,但如果没有生成输入的代码示例,很难知道。
You're indexing the dataframe list with a boolean (which will be zero or one, so you're only ever going to get the length of the first or second element) instead of a number.您正在使用布尔值(将是零或一,因此您只会获得第一个或第二个元素的长度)而不是数字来索引数据框列表。 It's going to get evaluated like so:
它将像这样被评估:
len(dataframe[(time_point <= dataframe["departure"]) & (dataframe["departure"] < time_point_2) & (dataframe['departure_name'] == station)])
len(dataframe[True & False & True]) # let's just say the variables work out like this
len(dataframe[False])
len(dataframe[0])
This probably isn't the behavior you're after.这可能不是您所追求的行为。 (let me know what you're trying to do in a comment and I'll try to help out more.)
(让我知道您在评论中想要做什么,我会尽力提供更多帮助。)
In terms of code speed specifically, &
is bitwise "AND", in python the boolean operators are written out as and
, or
, and not
.具体就代码速度而言,
&
是按位“与”,在 python 中,布尔运算符被写为and
, or
, and not
。 Using and
here would speed up your code, since python only evaluates parts of boolean expressions where they're needed, eg在这里使用
and
会加速你的代码,因为 python 只在需要它们的地方评估部分布尔表达式,例如
from time import sleep
def slow_function():
sleep(3)
return False
# This line doesn't take 3 seconds to run as you may expect.
# Python sees "False and" and is smart enough to realize that whatever comes after is irrelevant.
# No matter what comes after "False and", it's never going to make the first half True.
# So, python doesn't bother evaluating it, and saves 3 seconds in the process.
False and slow_function()
# Some more examples that show python doesn't evaluate the right half unless it needs to
False and print("hi")
False and asdfasdfasdf
False and 42/0
# The same does not happen here. These are bitwise operators, expected to be applied to numbers.
# While it does produce the correct result for boolean inputs, it's going to be slower,
# since it can't rely on the same optimization.
False & slow_function()
# Of course, both of these still take 3 seconds, since the right half has to be evaluated either way.
True and slow_function()
True & slow_function()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.