[英]Insert rows to fill years gap in Pandas data frame
I'm having the following DataFrame:我有以下 DataFrame:
import pandas as pd
data = {'id': ['A', 'A','B', 'C'],
'year': [2002,2002, 2003, 2004],
'city':['London', 'Rome','Paris', 'Berlin'],
'appearence': [1,1,1,1]}
df = pd.DataFrame(data)
I want to fill gaps in the year column starting from 2000 till the year when appearance equals 1. The column appearance is always equal to 1 in the input DataFrame.我想填补从 2000 年到外观等于 1 的年份列中的空白。输入 DataFrame 中的列外观始终等于 1。 Please note that each ID could be in two different cities in the same year.
请注意,每个 ID 可能在同一年位于两个不同的城市。
The desired output:所需的 output:
import pandas as pd
data = {'id': ['A', 'A', 'A', 'A', 'A', 'A','B','B','B','B','C','C','C','C','C'],
'year': [2000, 2001, 2002, 2000, 2001, 2002,2000, 2001, 2002, 2003,2000,2001,2002,2003, 2004],
'city':['NaN', 'NaN','London','NaN', 'NaN','Rome', 'NaN', 'NaN','NaN','Paris', 'NaN', 'NaN','NaN','NaN','Berlin'],
'appearence': [0,0,1,0,0,1,0,0,0,1,0,0,0,0,1]}
df = pd.DataFrame(data)
Solution for prepend years starting per 2000
, working if appearence=1
in input DataFrame:从
2000
开始的前置年份的解决方案,如果输入 DataFrame 中的appearence=1
则工作:
f = lambda x: x.set_index('year').reindex(range(2000, x['year'].max() + 1))
df = df.groupby('id').apply(f).drop('id', axis=1).fillna({'appearence': 0}).reset_index()
print (df)
id year city appearence
0 A 2000 NaN 0.0
1 A 2001 NaN 0.0
2 A 2002 London 1.0
3 B 2000 NaN 0.0
4 B 2001 NaN 0.0
5 B 2002 NaN 0.0
6 B 2003 Paris 1.0
7 C 2000 NaN 0.0
8 C 2001 NaN 0.0
9 C 2002 NaN 0.0
10 C 2003 NaN 0.0
11 C 2004 Berlin 1.0
EDIT:编辑:
f = lambda x: x.set_index('year').reindex(range(2000, x['year'].max() + 1))
df = df.groupby([df.index, 'id']).apply(f).drop('id', axis=1).fillna({'appearence': 0}).droplevel(0).reset_index()
print (df)
id year city appearence
0 A 2000 NaN 0.0
1 A 2001 NaN 0.0
2 A 2002 London 1.0
3 A 2000 NaN 0.0
4 A 2001 NaN 0.0
5 A 2002 Rome 1.0
6 B 2000 NaN 0.0
7 B 2001 NaN 0.0
8 B 2002 NaN 0.0
9 B 2003 Paris 1.0
10 C 2000 NaN 0.0
11 C 2001 NaN 0.0
12 C 2002 NaN 0.0
13 C 2003 NaN 0.0
14 C 2004 Berlin 1.0
One option is to use complete from pyjanitor , to abstract the reshaping (explicitly expose missing rows):一种选择是使用pyjanitor中的complete来抽象整形(显式暴露缺失的行):
# pip install pyjanitor
import pandas as pd
import janitor
# create dictionary for new dates
dates = {"year": lambda df: range(2000, df.max() + 1)}
# execute complete, and fill the nulls with 0
(df.complete(dates, by="id", sort=True)
.fillna({"appearence": 0}, downcast="infer")
)
id year city appearence
0 A 2000 NaN 0
1 A 2001 NaN 0
2 A 2002 London 1
3 B 2000 NaN 0
4 B 2001 NaN 0
5 B 2002 NaN 0
6 B 2003 Paris 1
7 C 2000 NaN 0
8 C 2001 NaN 0
9 C 2002 NaN 0
10 C 2003 NaN 0
11 C 2004 Berlin 1
import pandas as pd
start = 2000
data = {'id': ['A', 'B', 'C'],
'year': [2002, 2003, 2004],
'city':['London', 'Paris', 'Berlin'],
'appearence': [1,1,1]}
row1 = []
row2 = []
row3 = []
row4 = []
counter = 0;
for i in data['year']:
for j in range(start,i+1):
row1.append(data['id'][counter])
row2.append(j)
row3.append("NaN")
row4.append(0)
row4.pop()
row4.append(data['appearence'][counter])
row3.pop()
row3.append(data['city'][counter])
counter = counter + 1
data = {'id': row1,
'year': row2,
'city':row3,
'appearence': row4}
df = pd.DataFrame(data)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.