简体   繁体   中英

pandas merge dataframes with multi-index

I have a bunch of multi-index pd.DataFrames containing statistics for weather observations taken at a set of ground stations on certain days. Here's the structure of one such data frame:

>>> df = pd.DataFrame({'month': [1, 1, 1, 1, 1, 1],
                   'day': [2]*6,
                   'station': ['A', 'B', 'C', 'D', 'E', 'F'],
                   'mean': [55, 40, 84, 31, 44, 12],
                   'sd': [1., 2., 1.2, 3., 4., 0.7]})
>>> df.set_index(['station', 'month', 'day'])
>>> df
>>> df = df.set_index(['station', 'month', 'day'])
>>> df
                   mean   sd
station month day           
A       1     2      55  1.0
B       1     2      40  2.0
C       1     2      84  1.2
D       1     2      31  3.0
E       1     2      44  4.0
F       1     2      12  0.7

df stores all the observations for day 2 of month 1 (say, Jan 2). The station index is unique for each dataframe (no two station IDs are the same). However, the set of stations may vary between individual dataframes, depending on the month and/or day.

Question : how do I combine these dataframes into a single dataframe with the following structure (ignore the exact mean and sd values, i've made those up):

>>> df
                   mean   sd
station month day           
A       1     1      55  1.0
              2      44  5.0
              3      34  1.2
(...)
        2     1      55  1.0
              2      44  5.0
              3      34  1.2
(...)   
B       1     1      31  3.0
              2      44  5.0
              3      34  1.2
(...)

You can use pd.concat:

In [15]: df1 = pd.DataFrame({'month': [1, 1, 1, 1, 1, 1],
    ...:                    'day': [2]*6,
    ...:                    'station': ['A', 'B', 'C', 'D', 'E', 'F'],
    ...:                    'mean': [55, 40, 84, 31, 44, 12],
    ...:                    'sd': [1., 2., 1.2, 3., 4., 0.7]}).set_index(["station", "month", "day"])

In [16]: df2 = pd.DataFrame({'month': [2, 2, 2, 2, 2, 2],
    ...:                    'day': [2]*6,
    ...:                    'station': ['A', 'B', 'C', 'D', 'G', 'F'],
    ...:                    'mean': [55, 40, 84, 31, 15, 12],
    ...:                    'sd': [1., 2., 1.2, 3.,1, 0.7]}).set_index(["station", "month", "day"])

In [19]: pd.concat([df1,df2]).sort_index()
Out[19]:
                   mean   sd
station month day
A       1     2      55  1.0
        2     2      55  1.0
B       1     2      40  2.0
        2     2      40  2.0
C       1     2      84  1.2
        2     2      84  1.2
D       1     2      31  3.0
        2     2      31  3.0
E       1     2      44  4.0
F       1     2      12  0.7
        2     2      12  0.7
G       2     2      15  1.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM