I have NOAA weather data. In it raw state it has year and month as rows and then days as columns. I want to expand the number of rows so that each row has a year, month, and day with the appropriate data in each row.
There is also a weather variables column where each row represents a different weather variable collected each month. The number of weather variables collected in a month can change. (In January there are two (tmax, tmin), in February there are three (tmax, tmin, prcp), and in March there is one (tmin).)
Here is an example df.
example_df = pd.DataFrame({'station': ['USC1', 'USC1', 'USC1', 'USC1', 'USC1', 'USC1'],
'year': [1993, 1993, 1993, 1993,1993, 1993],
'month': [1, 1, 2, 2, 2, 3],
'attribute':['tmax', 'tmin', 'tmax', 'tmin', 'prcp', 'tmax'],
'day1': range(1, 7, 1),
'day2': range(1, 7, 1),
'day3': range(1, 7, 1),
'day4': range(1, 7, 1),
})
example_df = example_df[['station', 'year', 'month', 'attribute', 'day1', 'day2', 'day3', 'day4']]
This is the solution I want,
solution_df = pd.DataFrame({'station': ['USC1', 'USC1', 'USC1', 'USC1', 'USC1', 'USC1','USC1', 'USC1', 'USC1', 'USC1', 'USC1', 'USC1'],
'year': [1993, 1993, 1993, 1993,1993, 1993, 1993, 1993, 1993, 1993,1993, 1993],
'month': [1, 1,1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'day':[1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
'tmax': [1, 1, 1, 1, 3, 3, 3, 3, 6, 6, 6, 6],
'tmin': [2, 2, 2, 2, 4, 4, 4, 4, np.nan, np.nan, np.nan, np.nan],
'prcp': [np.nan, np.nan, np.nan, np.nan, 5, 5, 5, 5, np.nan, np.nan, np.nan, np.nan]
})
solution_df = solution_df[['station', 'year', 'month', 'day', 'tmax', 'tmin', 'prcp']]
I have tried .T, pivot, melt, stack, and unstack to get the day columns to be rows with the correct months.
This is as close as I have gotten to success with the example dataset.
record_arr = example_df.to_records()
new_df = pd.DataFrame({'station': np.nan,
'year': np.nan,
'month':np.nan,
'day': np.nan,
'tmax':np.nan,
'tmin': np.nan,
'prcp':np.nan},
index = [1]
)
new_df.append ({'station': record_arr[0][1], 'year': record_arr[0][2], 'month':record_arr[0][3], 'tmax':record_arr[0][5], 'tmin':record_arr[1][5] }, ignore_index = True)
This requires pivot as well as melt (or unstack and stack). This is how I got it in two steps
df1 = example_df.set_index(['station', 'year', 'month', 'attribute']).stack().reset_index()
df1.set_index(['station', 'year', 'month', 'level_4','attribute'])[0].unstack().reset_index()
attribute station year month level_4 prcp tmax tmin
0 USC1 1993 1 day1 NaN 1.0 2.0
1 USC1 1993 1 day2 NaN 1.0 2.0
2 USC1 1993 1 day3 NaN 1.0 2.0
3 USC1 1993 1 day4 NaN 1.0 2.0
4 USC1 1993 2 day1 5.0 3.0 4.0
5 USC1 1993 2 day2 5.0 3.0 4.0
6 USC1 1993 2 day3 5.0 3.0 4.0
7 USC1 1993 2 day4 5.0 3.0 4.0
8 USC1 1993 3 day1 NaN 6.0 NaN
9 USC1 1993 3 day2 NaN 6.0 NaN
10 USC1 1993 3 day3 NaN 6.0 NaN
11 USC1 1993 3 day4 NaN 6.0 NaN
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.