Consider following dataframe:
df = pd.read_csv("data.csv")
print(df)
Category Year Month Count1 Count2
0 a 2017 December 5 9
1 a 2018 January 3 5
2 b 2017 October 7 6
3 b 2017 November 4 1
4 b 2018 March 3 3
I want to achieve this:
Category Year Month Count1 Count2
0 a 2017 October
1 a 2017 November
2 a 2017 December 5 9
3 a 2018 January 3 5
4 a 2018 February
5 a 2018 March
6 b 2017 October 7 6
7 b 2017 November 4 1
8 b 2017 December
9 b 2018 January
10 b 2018 February
11 b 2018 March 3 3
Here I've done so far:
months = {"January": 1, "February": 2, "March": 3, "April": 4, "May": 5, "June": 6, "July": 7, "August": 8, "September": 9, "October": 10, "November": 11, "December": 12}
df["Date"] = pd.to_datetime(10000 * df["Year"] + 100 * df["Month"].apply(months.get) + 1, format="%Y%m%d")
date_min = df["Date"].min()
date_max = df["Date"].max()
new_index = pd.MultiIndex.from_product([df["Category"].unique(), pd.date_range(date_min, date_max, freq="M")], names=["Category", "Date"])
df = df.set_index(["Category", "Date"]).reindex(new_index).reset_index()
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month_name()
df = df[["Category", "Year", "Month", "Count1", "Count2"]]
In the resulting dataframe last month (March) is missing and all "Count1", "Count2" values are NaN
This is complicated by the fact that you want to fill the category as well as the missing dates. One solution is to create a separate data frame for each category and then concatenate them all together.
df['Date'] = pd.to_datetime('1 '+df.Month.astype(str)+' '+df.Year.astype(str))
df_ix = pd.Series(1, index=df.Date.sort_values()).resample('MS').first().reset_index()
df_list = []
for cat in df.Category.unique():
df_temp = (df.query('Category==@cat')
.merge(df_ix, on='Date', how='right')
.get(['Date','Category','Count1','Count2'])
.sort_values('Date')
)
df_temp.Category = cat
df_temp = df_temp.fillna(0)
df_temp.loc[:,['Count1', 'Count2']] = df_temp.get(['Count1', 'Count2']).astype(int)
df_list.append(df_temp)
df2 = pd.concat(df_list, ignore_index=True)
df2['Month'] = df2.Date.apply(lambda x: x.strftime('%B'))
df2['Year'] = df2.Date.apply(lambda x: x.year)
df2.drop('Date', axis=1)
# returns:
Category Count1 Count2 Month Year
0 a 0 0 October 2017
1 a 0 0 November 2017
2 a 5 9 December 2017
3 a 3 5 January 2018
4 a 0 0 February 2018
5 a 0 0 March 2018
6 b 7 6 October 2017
7 b 4 1 November 2017
8 b 0 0 December 2017
9 b 0 0 January 2018
10 b 0 0 February 2018
11 b 3 3 March 2018
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.