简体   繁体   English

在 Pandas GroupBy 数据框中按 ID 计算两个日期之间的行数

[英]Count Number of Rows Between Two Dates BY ID in a Pandas GroupBy Dataframe

I have the following test DataFrame:我有以下测试数据帧:

import random
from datetime import timedelta
import pandas as pd
import datetime

#create test range of dates
rng=pd.date_range(datetime.date(2015,1,1),datetime.date(2015,7,31))
rnglist=rng.tolist()
testpts = range(100,121)
#create test dataframe
d={'jid':[i for i in range(100,121)], 'cid':[random.randint(1,2) for _ in testpts],
    'stdt':[rnglist[random.randint(0,len(rng))] for _ in testpts]}
df=pd.DataFrame(d)
df['enddt'] = df['stdt']+timedelta(days=random.randint(2,32))

Which gives a dataframe like the below, with a company id column 'cid', a unique id column 'jid', a start date 'stdt', and an enddt 'enddt'.它提供了如下所示的数据框,其中包含公司 ID 列“cid”、唯一 ID 列“jid”、开始日期“stdt”和结束日期“enddt”。

   cid  jid       stdt      enddt
0    1  100 2015-07-06 2015-07-13
1    1  101 2015-07-15 2015-07-22
2    2  102 2015-07-12 2015-07-19
3    2  103 2015-07-07 2015-07-14
4    2  104 2015-07-14 2015-07-21
5    1  105 2015-07-11 2015-07-18
6    1  106 2015-07-12 2015-07-19
7    2  107 2015-07-01 2015-07-08
8    2  108 2015-07-10 2015-07-17
9    2  109 2015-07-09 2015-07-16

What I need to do is the following: Count the number of jid that occur by cid, for each date(newdate) between the min(stdt) and max(enddt), where the newdate is between the stdt and the enddt.我需要做的是以下内容:对于 min(stdt) 和 max(enddt) 之间的每个 date(newdate),计算 cid 发生的 jid 的数量,其中 newdate 在 stdt 和 enddt 之间。

The resulting data set should be a dataframe that has for each cid, a column range of dates (newdate) that is between the min(stdt) and the max(enddt) specific to each cid, and a count (cnt) of the number of jid that the newdate is between of the min(stdt) and max(enddt).生成的数据集应该是一个数据框,其中包含每个 cid、位于每个 cid 特定的 min(stdt) 和 max(enddt) 之间的日期列范围 (newdate),以及数字的计数 (cnt) jid 的新日期介于 min(stdt) 和 max(enddt) 之间。 That resulting DataFrame should look like (this is just for 1 cid using above data):生成的 DataFrame 应该如下所示(这仅适用于使用上述数据的 1 个 cid):

cid newdate cnt
1   2015-07-06  1
1   2015-07-07  1
1   2015-07-08  1
1   2015-07-09  1
1   2015-07-10  1
1   2015-07-11  2
1   2015-07-12  3
1   2015-07-13  3
1   2015-07-14  2
1   2015-07-15  3
1   2015-07-16  3
1   2015-07-17  3
1   2015-07-18  3
1   2015-07-19  2
1   2015-07-20  1
1   2015-07-21  1
1   2015-07-22  1

I believe there should be a way to use pandas groupby (groupby cid), and some form of lambda(?) to pythonically create this new dataframe.我相信应该有一种方法可以使用 Pandas groupby (groupby cid) 和某种形式的 lambda(?) 以 Python 方式创建这个新数据框。

I currently run a loop that for each cid (I slice the cid rows out of the master df), in the loop determine the relevant date range (min stdt and max enddt for each cid frame, then for each of those newdates (range mindate-maxdate) it counts the number of jid where the newdate is between the stdt and enddt of each jid. Then I append each resulting dataset into a new dataframe which looks like the above.我目前为每个 cid 运行一个循环(我从主 df 中切出 cid 行),在循环中确定相关的日期范围(每个 cid 帧的最小 stdt 和最大 enddt,然后为每个新日期(范围思想) -maxdate) 它计算新日期在每个 jid 的 stdt 和 enddt 之间的 jid 的数量。然后我将每个结果数据集附加到一个新的数据帧中,如下所示。

But this is very expensive from a resource and time perspective.但从资源和时间的角度来看,这是非常昂贵的。 Doing this on millions of jid for thousands of cid literally takes a full day.在数百万个 jid 上以数千个 cid 执行此操作实际上需要一整天的时间。 I am hoping there is a simple(r) pandas solution here.我希望这里有一个简单的(r)pandas 解决方案。

My usual approach for these problems is to pivot and think in terms of events changing an accumulator.对于这些问题,我通常的方法是根据改变累加器的事件进行调整和思考。 Every new "stdt" we see adds +1 to the count;我们看到的每个新“stdt”都会在计数上增加 +1; every "enddt" we see adds -1.我们看到的每个“enddt”都会增加 -1。 (Adds -1 the next day, at least if I'm interpreting "between" the way you are. Some days I think we should ban the use of the word as too ambiguous..) (第二天加上 -1,至少如果我按照你的方式解释“介于”之间。有些日子我认为我们应该禁止使用这个词太含糊了..)

IOW, if we turn your frame to something like IOW,如果我们把你的框架变成类似的东西

>>> df.head()
    cid  jid  change       date
0     1  100       1 2015-01-06
1     1  101       1 2015-01-07
21    1  100      -1 2015-01-16
22    1  101      -1 2015-01-17
17    1  117       1 2015-03-01

then what we want is simply the cumulative sum of change (after suitable regrouping.) For example, something like那么我们想要的只是change的累积总和(经过适当的重组。)例如,像

df["enddt"] += timedelta(days=1)
df = pd.melt(df, id_vars=["cid", "jid"], var_name="change", value_name="date")
df["change"] = df["change"].replace({"stdt": 1, "enddt": -1})
df = df.sort(["cid", "date"])

df = df.groupby(["cid", "date"],as_index=False)["change"].sum()
df["count"] = df.groupby("cid")["change"].cumsum()

new_time = pd.date_range(df.date.min(), df.date.max())

df_parts = []
for cid, group in df.groupby("cid"):
    full_count = group[["date", "count"]].set_index("date")
    full_count = full_count.reindex(new_time)
    full_count = full_count.ffill().fillna(0)
    full_count["cid"] = cid
    df_parts.append(full_count)

df_new = pd.concat(df_parts)

which gives me something like这给了我类似的东西

>>> df_new.head(15)
            count  cid
2015-01-03      0    1
2015-01-04      0    1
2015-01-05      0    1
2015-01-06      1    1
2015-01-07      2    1
2015-01-08      2    1
2015-01-09      2    1
2015-01-10      2    1
2015-01-11      2    1
2015-01-12      2    1
2015-01-13      2    1
2015-01-14      2    1
2015-01-15      2    1
2015-01-16      1    1
2015-01-17      0    1

There may be off-by-one differences with regards to your expectations;您的期望可能存在逐一差异; you may have different ideas about how you should handle multiple overlapping jid s in the same time window (here they would count as 2);您可能对如何在同一时间窗口中处理多个重叠的jid有不同的想法(这里它们将计为 2); but the basic idea of working with the events should prove useful even if you have to tweak the details.但是即使您必须调整细节,处理事件的基本思想也应该证明是有用的。

Here is a solution I came up with (this will loop through the permutations of unique cid's and date range getting your counts):这是我想出的一个解决方案(这将遍历唯一 cid 和日期范围的排列以获得您的计数):

from itertools import product
df_new_date=pd.DataFrame(list(product(df.cid.unique(),pd.date_range(df.stdt.min(), df.enddt.max()))),columns=['cid','newdate'])
df_new_date['cnt']=df_new_date.apply(lambda row:df[(df['cid']==row['cid'])&(df['stdt']<=row['newdate'])&(df['enddt']>=row['newdate'])]['jid'].count(),axis=1)

>>> df_new_date.head(20) 
    cid    newdate  cnt
0     1 2015-07-01    0
1     1 2015-07-02    0
2     1 2015-07-03    0
3     1 2015-07-04    0
4     1 2015-07-05    0
5     1 2015-07-06    1
6     1 2015-07-07    1
7     1 2015-07-08    1
8     1 2015-07-09    1
9     1 2015-07-10    1
10    1 2015-07-11    2
11    1 2015-07-12    3
12    1 2015-07-13    3
13    1 2015-07-14    2
14    1 2015-07-15    3
15    1 2015-07-16    3
16    1 2015-07-17    3
17    1 2015-07-18    3
18    1 2015-07-19    2
19    1 2015-07-20    1

You could then drop the zeros if you don't want them.如果你不想要零,你可以去掉零。 I don't think this will be much better than your original solution, however.但是,我认为这不会比您原来的解决方案好得多。

I would like to suggest you use the following improvement on the loop provided by the @DSM solution:我建议您对 @DSM 解决方案提供的循环使用以下改进:

df_parts=[]
for cid in df.cid.unique():
    full_count=df[(df.cid==cid)][['cid','date','count']].set_index("date").asfreq("D", method='ffill')[['cid','count']].reset_index()
    df_parts.append(full_count[full_count['count']!=0])

df_new = pd.concat(df_parts)

>>> df_new
         date  cid  count
0  2015-07-06    1      1
1  2015-07-07    1      1
2  2015-07-08    1      1
3  2015-07-09    1      1
4  2015-07-10    1      1
5  2015-07-11    1      2
6  2015-07-12    1      3
7  2015-07-13    1      3
8  2015-07-14    1      2
9  2015-07-15    1      3
10 2015-07-16    1      3
11 2015-07-17    1      3
12 2015-07-18    1      3
13 2015-07-19    1      2
14 2015-07-20    1      1
15 2015-07-21    1      1
16 2015-07-22    1      1
0  2015-07-01    2      1
1  2015-07-02    2      1
2  2015-07-03    2      1
3  2015-07-04    2      1
4  2015-07-05    2      1
5  2015-07-06    2      1
6  2015-07-07    2      2
7  2015-07-08    2      2
8  2015-07-09    2      2
9  2015-07-10    2      3
10 2015-07-11    2      3
11 2015-07-12    2      4
12 2015-07-13    2      4
13 2015-07-14    2      5
14 2015-07-15    2      4
15 2015-07-16    2      4
16 2015-07-17    2      3
17 2015-07-18    2      2
18 2015-07-19    2      2
19 2015-07-20    2      1
20 2015-07-21    2      1

Only real improvement over what @DSM provided is that this will avoid requiring the creation of a groubby object for the loop and this will also get you all the min stdt and max enddt per cid number with no zero values.对@DSM 提供的唯一真正改进是,这将避免需要为循环创建 groubby 对象,并且这还将为您提供每个 cid 编号的所有 min stdt 和 max enddt,没有零值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM