简体   繁体   English

如何提高子集 pandas dataframe 的计算速度?

[英]How to improve the computation speed of subsetting a pandas dataframe?

I have a large df (14*1'000'000) and I want to subset it.我有一个很大的 df (14*1'000'000),我想对它进行子集化。 The calculation seems to take unsurprisingly a lot of time though and I wonder how to improve the speed.计算似乎花费了很多时间,这并不奇怪,我想知道如何提高速度。

What I want is to subset for each Name the lowest value of Total_time while ignoring zero values and picking only the first one if there is more than one row has the lowest value of Total_time .我想要的是为每个Name设置Total_time的最小值,同时忽略零值,如果有多个行具有Total_time的最小值,则只选择第一个。 And then I want it to be all appended into a new dataframe unique .然后我希望它全部附加到一个新的 dataframe unique中。

Is there a general mistake in my code that makes it inefficient?我的代码中是否存在使其效率低下的一般错误?

unique = pd.DataFrame([])
i=0
for pair in df['Name'].unique():
    i=i+1
    temp =df[df["Name"]== pair]
    temp2 = temp.loc[df['Total_time']  != 0]
    lowest = temp2['Total_time'].min()
    temp3 = temp2[temp2["Total_time"] == lowest].head(1)
    unique = unique.append(temp3)
    print("finished "+ pair + " "+ str(i))

in general, you don't want to iterate over each item.通常,您不想遍历每个项目。

if you want the Name with the smallest time:如果您想要时间最短的名称:

new_df = df[df["Total_time"] != 0].copy() # you seem to be throwing away 0
out = new_df.groupby("Name")["Total_time"].min()

If you need the rest of the columns:如果需要rest的列:

new_df.loc[new_df.groupby("Name")["total_time"].idxmin()] 

What I want is to subset for each Name the lowest value of Total_time while ignoring zero values and picking only the first one if there is more than one row has the lowest value of Total_time .我想要的是为每个Name设置Total_time的最小值,同时忽略零值,如果有多个行具有Total_time的最小值,则只选择第一个。

This sounds like task for pandas.Series.idxmin consider following simple example这听起来像是pandas.Series.idxmin的任务,请考虑以下简单示例

import pandas as pd
df = pd.DataFrame({"X":["A","B","C","D","E"],"Y":[5.5,0.0,5.5,1.5,1.5]})
first_min = df.Y.replace(0,float("nan")).idxmin()
print(df.iloc[first_min])

output output

X      D
Y    1.5
Name: 3, dtype: object

Explanation: replace 0 with NaN so they are not considered, then use idxmin to get index of 1st minimum, which might be used with .iloc .说明:将 0 替换为 NaN 以便不考虑它们,然后使用 idxmin 获取第一个最小值的索引,这可能与.iloc一起使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM