[英]how to fill missing dates group by in pandas DataFrame
我最初的 dataframe 是
df = pd.DataFrame({"a":["2020-01-01", "2020-01-06", "2020-01-04", "2020-01-07"],
"b":["a", "a", "b", "b"],
"c":[1, 2, 3,4]})
print(df)
a b c
0 2020-01-01 a 1
1 2020-01-06 a 2
2 2020-01-04 b 3
3 2020-01-07 b 4
我希望我的數據集是這樣的
a b c
0 2020-01-01 a 1
1 2020-01-02 a NaN
2 2020-01-03 a NaN
3 2020-01-04 a NaN
4 2020-01-05 a NaN
5 2020-01-06 a 2
6 2020-01-04 b 3
7 2020-01-05 b NaN
8 2020-01-06 b NaN
3 2020-01-07 b 4
我試過了
d.set_index([d.a, d.b], inplace=True)
d.asfreq("D")
d.set_index([d.a, d.b], inplace=True)
d.resample("D")
但我遇到
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'MultiIndex'
enter code here
我真正的 DataFrame 的列(本例中的“b”列)具有許多唯一值。
df = pd.DataFrame({"a":["2020-01-01", "2020-01-06", "2020-01-04", "2020-01-07"],
"b":["a", "a", "b", "b"],
"c":[1, 2, 3,4]})
# make datetime
df['a'] = pd.to_datetime(df['a'])
# create a group
g = df.groupby('b')
# list comprehension with reindex and date_range then concat list of frames
df2 = pd.concat([df.set_index('a').reindex(pd.date_range(df['a'].min(),
df['a'].max(),freq='D')) for _,df in g])
# ffill column b
df2['b'] = df2['b'].ffill()
b c
2020-01-01 a 1.0
2020-01-02 a NaN
2020-01-03 a NaN
2020-01-04 a NaN
2020-01-05 a NaN
2020-01-06 a 2.0
2020-01-04 b 3.0
2020-01-05 b NaN
2020-01-06 b NaN
2020-01-07 b 4.0
groupby
和asfreq
的另一種方法:
(df.set_index('a')
.groupby('b').apply(lambda x: x.drop('b',axis=1).asfreq('D'))
.reset_index()
)
Output:
b a c
0 a 2020-01-01 1.0
1 a 2020-01-02 NaN
2 a 2020-01-03 NaN
3 a 2020-01-04 NaN
4 a 2020-01-05 NaN
5 a 2020-01-06 2.0
6 b 2020-01-04 3.0
7 b 2020-01-05 NaN
8 b 2020-01-06 NaN
9 b 2020-01-07 4.0
我們可以使用來自pyjanitor的完整function ,它提供了一個方便的抽象來生成丟失的行:
# pip install pyjanitor
import pandas as pd
import janitor as n
df['a'] = pd.to_datetime(df['a'])
# create a mapping for the new dates
dates = {"a" : lambda a : pd.date_range(a.min(), a.max(), freq='1D')}
# create the new dataframe, exposing the missing rows, per group:
df.complete(dates, by='b', sort = True)
a b c
0 2020-01-01 a 1.0
1 2020-01-02 a NaN
2 2020-01-03 a NaN
3 2020-01-04 a NaN
4 2020-01-05 a NaN
5 2020-01-06 a 2.0
6 2020-01-04 b 3.0
7 2020-01-05 b NaN
8 2020-01-06 b NaN
9 2020-01-07 b 4.0
by
選項主要是為了方便; 為了獲得更高的性能,需要做更多的工作:
# build the mapping for the entire dataframe, ignoring the groups
dates = {"a" : pd.date_range(df.a.min(), df.a.max(), freq='1D')}
# create a groupby object
grouped = df.groupby('b').a
# create temporary columns for the min and max dates
(df.assign(date_min = grouped.transform('min'),
date_max = grouped.transform('max'))
# expose the missing rows based on the combination of dates
# and the tuple of b, date_min and date_max
.complete(dates, ('b', 'date_min', 'date_max'))
# filter for rows, this will keep only the relevant dates
.loc[lambda df: df.a.between(df.date_min, df.date_max),
df.columns]
.sort_values(['b', 'a'], ignore_index = True)
)
a b c
0 2020-01-01 a 1.0
1 2020-01-02 a NaN
2 2020-01-03 a NaN
3 2020-01-04 a NaN
4 2020-01-05 a NaN
5 2020-01-06 a 2.0
6 2020-01-04 b 3.0
7 2020-01-05 b NaN
8 2020-01-06 b NaN
9 2020-01-07 b 4.0
對於大型數據幀,性能將顯着/更好。 這里提供了一個類似的測試示例
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.