I have the below df:
import pandas as pd
df = pd.DataFrame(
{"col1": [2000, 2000, 2000, '', 2001, 2001, '', '', 2002],
"col2": ["b1", "c1", "d1", '' , "c1", "d1", '', '', "d1"],
"col3": [10, 20, 30, '', 20, 40, '', '', 60]
}
)
df
col1 col2 col3
0 2000 b1 10
1 2000 c1 20
2 2000 d1 30
3
4 2001 c1 20
5 2001 d1 40
6
7
8 2002 d1 60
I need 3 rows for each date from 2000 to 2002 and each date will have b1, c1 and d1. When a row is missing (like rows 3, 6 and 7) I want to fill it so that it has a date, a b1, c1 or d1 and col3 will be 0 just like in df2 below:
df2 = pd.DataFrame(
{"col1": [2000, 2000, 2000, 2001, 2001, 2001, 2002, 2002, 2002],
"col2": ["b1", "c1", "d1", "b1" , "c1", "d1", "b1", "c1", "d1"],
"col3": [10, 20, 30, 0, 20, 40, 0, 0, 60]
}
)
df2
col1 col2 col3
0 2000 b1 10
1 2000 c1 20
2 2000 d1 30
3 2001 b1 0
4 2001 c1 20
5 2001 d1 40
6 2002 b1 0
7 2002 c1 0
8 2002 d1 60
How to do this in pandas? (I have a large dataframe like this with many dates not just 3 but this example will help me get the idea!)
You can take a cartesian product of the expected values with year and create possibilities.
Then merge(left join) and fillna:
df = df.replace('',np.nan).dropna(subset=['col1'])
rows = ['b1','c1','d1']
possibilities = pd.MultiIndex.from_product((df['col1'].unique(),rows))
out = (pd.DataFrame(possibilities.tolist(),columns=['col1','col2'])
.merge(df,how='left').fillna({"col3":0},downcast='infer'))
out['col1']= out['col1'].astype(int)
Or:
out = (possibilities.to_frame(name=['col1','col2']).merge(df,how='left')
.fillna({"col3":0},downcast='infer'))
out['col1']= out['col1'].astype(int)
print(out)
col1 col2 col3
0 2000 b1 10
1 2000 c1 20
2 2000 d1 30
3 2001 b1 0
4 2001 c1 20
5 2001 d1 40
6 2002 b1 0
7 2002 c1 0
8 2002 d1 60
Use DataFrame.reindex
for add 0
for not existed combinations:
df = df2.replace('',np.nan).dropna(subset=['col1'])
rows = ['b1','c1','d1']
mux = pd.MultiIndex.from_product((df['col1'].unique(),rows), names=['col1','col2'])
df = df2.set_index(['col1','col2']).reindex(mux, fill_value=0).reset_index()
print (df)
col1 col2 col3
0 2000 b1 10
1 2000 c1 20
2 2000 d1 30
3 2001 b1 0
4 2001 c1 20
5 2001 d1 40
6 2002 b1 0
7 2002 c1 0
8 2002 d1 60
First fill the original empty string with NaN
df = df.replace('', np.nan)
Then create a dummy dataframe from
dummy = pd.DataFrame([[x, y] for x in df['col1'].dropna().unique() for y in df['col2'].dropna().unique()], columns=['col1', 'col2'])
# You can also try multi index
# mux = pd.MultiIndex.from_product((df['col1'].dropna().unique(), df['col2'].dropna().unique()), names=['col1','col2'])
# dummy = pd.DataFrame({'col3': [0]*len(mux)}, index=mux).reset_index().reset_index()
print(dummy)
col1 col2
0 2000.0 b1
1 2000.0 c1
2 2000.0 d1
3 2001.0 b1
4 2001.0 c1
5 2001.0 d1
6 2002.0 b1
7 2002.0 c1
8 2002.0 d1
At last, update NaN
values in your original dataframe with dummy dataframe.
df.update(dummy, overwrite=False)
df.fillna(0, inplace=True)
print(df)
col1 col2 col3
0 2000.0 b1 10.0
1 2000.0 c1 20.0
2 2000.0 d1 30.0
3 2001.0 b1 0.0
4 2001.0 c1 20.0
5 2001.0 d1 40.0
6 2002.0 b1 0.0
7 2002.0 c1 0.0
8 2002.0 d1 60.0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.