[英]Rolling window count for a date interval in pandas
我具有項目及其相關計划的開始和結束時間的歷史記錄:
id planned_start planned_end
1 2017-09-12 2017-09-13
2 2017-09-12 2017-09-14
3 2017-09-12 2017-09-13
4 2017-09-13 2017-09-13
5 2017-09-12 2017-09-12
6 2017-09-12 2017-09-20
7 2017-09-14 2017-09-15
8 2017-09-14 2017-09-20
我想計算以上項目的每個開始日期的並發項目數。 這是我的邏輯:
for project_id in df['id']:
start_date = df[df['id'] == project_id]['planned_start'].values[0]
concurrent_projects = df[(df['planned_start'] <= start_date) & (df['planned_end'] >= start_date)]
df.ix[df['id'] == project_id, 'concurrent_projects'] = concurrent_projects.shape[0]
產生此:
id planned_start planned_end concurrent_projects
0 1 2017-09-12 2017-09-13 5.0
1 2 2017-09-12 2017-09-14 5.0
2 3 2017-09-12 2017-09-13 5.0
3 4 2017-09-13 2017-09-13 5.0
4 5 2017-09-12 2017-09-12 5.0
5 6 2017-09-12 2017-09-20 5.0
6 7 2017-09-14 2017-09-15 4.0
7 8 2017-09-14 2017-09-20 4.0
但是,我知道上述for
循環在時間上是次優的。 實際上,我有超過500,000個項目需要進行此數學運算。 有人可以提供一些有關如何加快速度的建議嗎? 我知道必須有一個純熊貓甚至麻木的解決方案,這會殺死我上面的東西。
向量化方式...但是會炸毀內存。 仍在研究更好的矢量化方式。 我有概念,只是在吃晚飯的同時致力於細節。
s = df.planned_start.values
e = df.planned_end.values
s_ = s >= s[:, None]
e_ = s <= e[:, None]
df.assign(concurrent_projects=(e_ & s_).sum(0))
id planned_start planned_end concurrent_projects
0 1 2017-09-12 2017-09-13 5
1 2 2017-09-12 2017-09-14 5
2 3 2017-09-12 2017-09-13 5
3 4 2017-09-13 2017-09-13 5
4 5 2017-09-12 2017-09-12 5
5 6 2017-09-12 2017-09-20 5
6 7 2017-09-14 2017-09-15 4
7 8 2017-09-14 2017-09-20 4
抱歉,我沒有時間解釋。 但我不想讓你掛
k = len(df)
d = np.column_stack([df.planned_start.values, df.planned_end.values + 1]).ravel()
i = np.tile([1, -1], k)
a = d.argsort()
f = np.arange(k).repeat(2)
r = np.zeros(k, int)
z = np.zeros(k, int)
m = np.zeros(k, int)
cumsum = 0
for j in range(f.size):
x = f[a[j]]
y = i[a[j]]
r[x] = cumsum
z[x] = (y + 1) // 2
r += y * z
m = np.column_stack([m, r]).max(1)
cumsum += y
m
array([5, 5, 5, 5, 5, 5, 4, 4])
這是我的解決方案,通過使用crosstab
,基本上是使用martix
唇上做計算(輸入Dataframe
df2
):
df=pd.crosstab(df2.planned_end,df2.planned_start,margins=True)
df=pd.concat([df,pd.DataFrame(columns=list(set(df.index)- set(df.columns)))]).fillna(0)
df2['concurrent_projects']=df2.planned_start.map(df.loc['All',:].cumsum()-df.All.cumsum().shift().fillna(0))
df2
Out[112]:
id planned_start planned_end concurrent_projects
0 1 2017-09-12 2017-09-13 5.0
1 2 2017-09-12 2017-09-14 5.0
2 3 2017-09-12 2017-09-13 5.0
3 4 2017-09-13 2017-09-13 5.0
4 5 2017-09-12 2017-09-12 5.0
5 6 2017-09-12 2017-09-20 5.0
6 7 2017-09-14 2017-09-15 4.0
7 8 2017-09-14 2017-09-20 4.0
使用apply
可使速度提高大約3倍。
當前方法:
%%timeit
def concurrent_count_using_loop():
for project_id in df['id']:
start_date = df[df['id'] == project_id]['planned_start'].values[0]
concurrent_projects = df[(df['planned_start'] <= start_date) & (df['planned_end'] >= start_date)]
df.ix[df['id'] == project_id, 'concurrent_projects'] = concurrent_projects.shape[0]
concurrent_count_using_loop()
# 10 loops, best of 3: 21.4 ms per loop
與apply()
:
%%timeit
def concurrent_count(project):
valid_start = df.planned_start <= project["planned_start"]
valid_end = df.planned_end >= project["planned_start"]
return (valid_start & valid_end).sum()
df["concurrent_projects"] = df.apply(concurrent_count, axis=1)
# 100 loops, best of 3: 6.94 ms per loop
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.