[英]Pythonic Solution for Improving Runtime Efficiency
我想改进一个 python 程序的运行时,该程序使用一个 Pandas 数据框并根据几个条件创建两个新变量(组和组日期)(代码和逻辑如下)。 该代码在小型数据集上运行良好,但在大型数据集(2000 万行)上运行需要 7 个多小时。
import pandas as pd
import numpy as np
ID = ['a1','a1','a1','a1','a1','a2','a2','a2','a2','a2']
DATE = ['1/1/2014','1/15/2014','1/20/2014','1/22/2014','3/10/2015', \
'1/13/2015','1/20/2015','1/28/2015','2/28/2015','3/20/2015']
ITEM = ['P1','P2','P3','P4','P5','P1','P2','P3','P4','P5']
df = pd.DataFrame({"ID": ID, "DATE": DATE, "ITEM": ITEM})
df['DATE']= pd.to_datetime(df['DATE'], format = '%m/%d/%Y')
ids=df.ID
df['first_id'] = np.where((ids!=ids.shift(1)), 1, 0)
df['last_id'] = np.where((ids!=ids.shift(-1)), 1, 0)
print(df); print('\n')
for i in range(0,len(df)):
if df.loc[i,'first_id']==1:
df.loc[i,'group'] = 1
df.loc[i,'groupdate'] = df.loc[i,'DATE']
elif df.loc[i,'first_id']==0 and ((df.loc[i,'DATE'] - df.loc[i-1,'DATE']).days > 10) or \
((df.loc[i,'DATE'] - df.loc[i-1,'groupdate']).days > 10):
df.loc[i,'group'] = df.loc[i-1,'group'] + 1
df.loc[i,'groupdate'] = df.loc[i,'DATE']
else:
if df.loc[i,'first_id']==0 and ((df.loc[i,'DATE'] - df.loc[i-1,'DATE']).days <= 10) or \
((df.loc[i,'DATE'] - df.loc[i-1,'groupdate']).days <= 10):
df.loc[i,'group'] = df.loc[i-1,'group']
df.loc[i,'groupdate'] = df.loc[i-1,'groupdate']
print(df); print('\n')
ID DATE ITEM GROUP GROUPDATE
1 1/1/2014 P1 1 1/1/2014
1 1/15/2014 P2 2 1/15/2014
1 1/20/2014 P3 2 1/15/2014
1 1/22/2014 P4 2 1/15/2014
1 3/10/2015 P5 3 3/10/2015
2 1/13/2015 P1 1 1/13/2015
2 1/20/2015 P2 1 1/13/2015
2 1/28/2015 P3 2 1/28/2015
2 2/28/2015 P4 3 2/28/2015
2 3/20/2015 P5 4 3/20/2015
请不要将此作为完整的答案,而是将其视为正在进行的工作和起点。
groupby
previous_groupdate
的逻辑import pandas as pd
import numpy as np
ID = ['a1','a1','a1','a1','a1','a2','a2','a2','a2','a2']
DATE = ['1/1/2014','1/15/2014','1/20/2014','1/22/2014','3/10/2015', \
'1/13/2015','1/20/2015','1/28/2015','2/28/2015','3/20/2015']
ITEM = ['P1','P2','P3','P4','P5','P1','P2','P3','P4','P5']
df = pd.DataFrame({"ID": ID, "DATE": DATE, "ITEM": ITEM})
df['DATE']= pd.to_datetime(df['DATE'], format = '%m/%d/%Y')
ids=df.ID
df['first_id'] = np.where((ids!=ids.shift(1)), 1, 0)
def fun(x):
# To compare with previous date I add a column
x["PREVIOUS_DATE"] = x["DATE"].shift(1)
x["DATE_DIFF1"] = (x["DATE"]-x["PREVIOUS_DATE"]).dt.days
# These are your simplified conditions
conds = [x["first_id"]==1,
((x["first_id"]==0) & (x["DATE_DIFF1"]>10)),
((x["first_id"]==0) & (x["DATE_DIFF1"]<=10))]
# choices for date
choices_date = [x["DATE"].astype(str),
x["DATE"].astype(str),
'']
# choices for group
# To get the expected output we'll need a cumsum
choices_group = [ 1, 1, 0]
# I use np.select you can check how it works
x["group_date"] = np.select(conds, choices_date, default="")
x["group"] = np.select(conds, choices_group, default=0)
# some group_date are empty so I fill them
x["group_date"] = x["group_date"].astype("M8[us]").fillna(method="ffill")
# Here is the cumsum
x["group"] = x["group"].cumsum()
# Remove columns we don't need
x = x.drop(["first_id", "PREVIOUS_DATE", "DATE_DIFF1"], axis=1)
return x
df = df.groupby("ID").apply(fun)
ID DATE ITEM group_date group
0 a1 2014-01-01 P1 2014-01-01 1
1 a1 2014-01-15 P2 2014-01-15 2
2 a1 2014-01-20 P3 2014-01-15 2
3 a1 2014-01-22 P4 2014-01-15 2
4 a1 2015-03-10 P5 2015-03-10 3
5 a2 2014-01-01 P1 2014-01-01 1
6 a2 2014-01-15 P2 2014-01-15 2
7 a2 2014-01-20 P3 2014-01-15 2
8 a2 2014-01-22 P4 2014-01-15 2
9 a2 2015-03-10 P5 2015-03-10 3
在这里,你能想到使用DASK ,莫丁或cuDF看到MODIN VS cuDF可能是你应该努力就如何处理前组织数据。 我说的是像 这个它是我,对不起,为您提供有关分区的数据如何能正确地加快速度的想法。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.