简体   繁体   中英

How to reindex on month and year columns to insert missing data?

Consider following dataframe:

df = pd.read_csv("data.csv")
print(df)
  Category  Year     Month  Count1  Count2
0        a  2017  December       5       9
1        a  2018   January       3       5
2        b  2017   October       7       6
3        b  2017  November       4       1
4        b  2018     March       3       3

I want to achieve this:

   Category  Year     Month  Count1  Count2
0         a  2017   October               
1         a  2017  November              
2         a  2017  December       5       9
3         a  2018   January       3       5
4         a  2018  February              
5         a  2018     March              
6         b  2017   October       7       6
7         b  2017  November       4       1
8         b  2017  December              
9         b  2018   January              
10        b  2018  February              
11        b  2018     March       3       3

Here I've done so far:

months = {"January": 1, "February": 2, "March": 3, "April": 4, "May": 5, "June": 6, "July": 7, "August": 8, "September": 9, "October": 10, "November": 11, "December": 12}
df["Date"] = pd.to_datetime(10000 * df["Year"] + 100 * df["Month"].apply(months.get) + 1, format="%Y%m%d")
date_min = df["Date"].min()
date_max = df["Date"].max()
new_index = pd.MultiIndex.from_product([df["Category"].unique(), pd.date_range(date_min, date_max, freq="M")], names=["Category", "Date"])
df = df.set_index(["Category", "Date"]).reindex(new_index).reset_index()
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month_name()
df = df[["Category", "Year", "Month", "Count1", "Count2"]]

In the resulting dataframe last month (March) is missing and all "Count1", "Count2" values are NaN

This is complicated by the fact that you want to fill the category as well as the missing dates. One solution is to create a separate data frame for each category and then concatenate them all together.

df['Date'] = pd.to_datetime('1 '+df.Month.astype(str)+' '+df.Year.astype(str))

df_ix = pd.Series(1, index=df.Date.sort_values()).resample('MS').first().reset_index()

df_list = []
for cat in df.Category.unique():
    df_temp = (df.query('Category==@cat')
                 .merge(df_ix, on='Date', how='right')
                 .get(['Date','Category','Count1','Count2'])
                 .sort_values('Date')
        )
    df_temp.Category = cat
    df_temp = df_temp.fillna(0)
    df_temp.loc[:,['Count1', 'Count2']] = df_temp.get(['Count1', 'Count2']).astype(int)
    df_list.append(df_temp)

df2 = pd.concat(df_list, ignore_index=True)
df2['Month'] = df2.Date.apply(lambda x: x.strftime('%B'))
df2['Year'] = df2.Date.apply(lambda x: x.year)
df2.drop('Date', axis=1)
# returns:
   Category  Count1  Count2     Month  Year
0         a       0       0   October  2017
1         a       0       0  November  2017
2         a       5       9  December  2017
3         a       3       5   January  2018
4         a       0       0  February  2018
5         a       0       0     March  2018
6         b       7       6   October  2017
7         b       4       1  November  2017
8         b       0       0  December  2017
9         b       0       0   January  2018
10        b       0       0  February  2018
11        b       3       3     March  2018

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM