[英]PYTHON/PANDAS - Reindexing on multiple indexes
I have a dataframe similar to what follows:我有一个类似于以下内容的数据框:
test = {"id": ["A", "A", "A", "B", "B", "B"],
"date": ["09-02-2013", "09-03-2013", "09-05-2013", "09-15-2013", "09-17-2013", "09-18-2013"],
"country": ["Poland", "Poland", "France", "Scotland", "Scotland", "Canada"]}
and I want a table which returns this :我想要一个返回这个的表:
id ![]() |
date![]() |
country![]() |
---|---|---|
A![]() |
09-02-2013 ![]() |
Poland![]() |
A![]() |
09-03-2013 ![]() |
Poland![]() |
A![]() |
09-04-2013 ![]() |
Poland![]() |
A![]() |
09-05-2013 ![]() |
France![]() |
B![]() |
09-15-2013 ![]() |
Scotland![]() |
B![]() |
09-16-2013 ![]() |
Scotland![]() |
B![]() |
09-17-2013 ![]() |
Scotland![]() |
B![]() |
09-18-2013 ![]() |
Canada![]() |
ie a table that fills in any date that I am missing but will only do it to the min/max of each id即填写我丢失的任何日期但只会填写每个ID的最小/最大值的表格
I have looked around stack overflow but usually this problem just has one index or the person wants to drop an index anyway This is what I have got so far:我已经查看了堆栈溢出,但通常这个问题只有一个索引,或者这个人无论如何都想删除一个索引这是我到目前为止所得到的:
test_df = pd.DataFrame(test)
# get min date per id
dates = test_df.groupby("id")["date"].min().to_frame(name="min")
# get max date
dates["max"] = test_df.groupby("id")["date"].max().to_frame(name="max")
midx = pd.MultiIndex.from_frame(dates.apply(lambda x: pd.date_range(x["min"], x["max"], freq="D"), axis=1).explode().reset_index(name="date")[["date", "id"]])
test_df = test_df.set_index(["date", "id"])
test_df = test_df.reindex(midx).fillna(method="ffill")
test_df
Which gets me really close but not quite there, with the dates all there but no country:这让我非常接近但并不完全在那里,日期都在那里但没有国家:
id ![]() |
date![]() |
country![]() |
---|---|---|
A![]() |
09-02-2013 ![]() |
NaN![]() |
A![]() |
09-03-2013 ![]() |
NaN![]() |
A![]() |
09-04-2013 ![]() |
NaN![]() |
A![]() |
09-05-2013 ![]() |
NaN![]() |
B![]() |
09-15-2013 ![]() |
NaN![]() |
B![]() |
09-16-2013 ![]() |
NaN![]() |
B![]() |
09-17-2013 ![]() |
NaN![]() |
B![]() |
09-18-2013 ![]() |
NaN![]() |
Any ideas on how to fix it?关于如何解决它的任何想法?
IIUC, you could generate a date_range
per group, explode
, then merge
and ffill
the values per group: IIUC,您可以为每组生成一个
date_range
, explode
,然后merge
并ffill
的值:
out = (test_df
.merge(pd
.to_datetime(test_df['date'], dayfirst=False)
.groupby(test_df['id'])
.apply(lambda g: pd.date_range(g.min(), g.max(), freq='D'))
.explode().dt.strftime('%m-%d-%Y')
.reset_index(name='date'),
how='right'
)
.assign(country=lambda d: d.groupby('id')['country'].ffill())
)
output:输出:
id date country
0 A 09-02-2013 Poland
1 A 09-03-2013 Poland
2 A 09-04-2013 Poland
3 A 09-05-2013 France
4 B 09-15-2013 Scotland
5 B 09-16-2013 Scotland
6 B 09-17-2013 Scotland
7 B 09-18-2013 Canada
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.