简体   繁体   中英

Pandas - join two DataFrames by multiple index and columns

I have two Pandas DataFrame objects that need to be joined by multiple index and columns.

DF1 with daily data (Indices are RNK, R_ID, latitude and longitude):

                                Date        FFDI
RNK R_ID latitude   longitude               
1   0   -39.20000   140.80000   1973-04-02  5.40000
    1   -39.20000   140.83786   1973-04-02  5.40000
    2   -39.20000   140.87572   1973-04-02  5.40000
    3   -39.20000   140.91359   1973-04-02  5.40000
    4   -39.20000   140.95145   1973-04-02  5.40000
    5   -39.20000   140.98930   1973-04-02  5.40000
    6   -39.20000   141.02716   1973-04-02  5.40000
    7   -39.20000   141.06502   1973-05-31  5.40000
    8   -39.20000   141.10289   1973-05-31  5.50000
    9   -39.20000   141.14075   1973-05-31  6.00000
    10  -39.20000   141.17860   1973-05-31  6.40000
    11  -39.20000   141.21646   1973-05-31  6.80000
    12  -39.20000   141.25432   1973-05-31  7.70000
    13  -39.20000   141.29219   1973-05-31  7.90000
    14  -39.20000   141.33005   1973-05-31  7.00000
    15  -39.20000   141.36790   1973-05-31  6.60000
    16  -39.20000   141.40576   1973-05-31  6.10000
    17  -39.20000   141.44362   1973-05-31  5.00000
    18  -39.20000   141.48149   1973-05-31  4.40000
    19  -39.20000   141.51935   1972-04-21  4.40000
    20  -39.20000   141.55721   1972-04-21  4.40000
    21  -39.20000   141.59506   1972-04-21  4.50000
    22  -39.20000   141.63292   1972-04-21  4.60000
    23  -39.20000   141.67079   1972-04-21  4.70000
    24  -39.20000   141.70865   1972-04-21  4.70000
    25  -39.20000   141.74651   1972-04-21  4.70000
    26  -39.20000   141.78436   1972-04-21  4.70000
    27  -39.20000   141.82222   1972-04-21  4.70000
    28  -39.20000   141.86009   1972-04-21  4.70000
    29  -39.20000   141.89795   1972-04-21  4.70000
... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ...
... ... ... ... ... ... ... ... ... ... ... ...
5   36082   -33.90000   148.90205   1972-12-24  35.70000
    36083   -33.90000   148.93991   1974-11-12  36.30000
    36084   -33.90000   148.97778   1974-11-12  35.90000
    36085   -33.90000   149.01564   1973-11-20  36.80000
    36086   -33.90000   149.05350   1973-11-20  37.00000
    36087   -33.90000   149.09135   1974-11-12  35.60000
    36088   -33.90000   149.12921   1973-01-03  35.90000
    36089   -33.90000   149.16708   1973-01-03  34.40000
    36090   -33.90000   149.20494   1973-01-03  32.90000
    36091   -33.90000   149.24280   1973-01-03  32.20000
    36092   -33.90000   149.28065   1973-01-03  32.30000
    36093   -33.90000   149.31851   1973-01-03  32.20000
    36094   -33.90000   149.35638   1973-01-03  30.20000
    36095   -33.90000   149.39424   1973-11-20  28.60000
    36096   -33.90000   149.43210   1973-11-20  28.70000
    36097   -33.90000   149.46996   1973-11-20  29.10000
    36098   -33.90000   149.50781   1973-11-20  30.10000
    36099   -33.90000   149.54568   1973-11-20  30.80000
    36100   -33.90000   149.58354   1973-01-09  30.60000
    36101   -33.90000   149.62140   1973-01-09  30.10000
    36102   -33.90000   149.65926   1973-01-09  29.50000
    36103   -33.90000   149.69711   1973-01-09  29.20000
    36104   -33.90000   149.73499   1973-01-09  29.90000
    36105   -33.90000   149.77284   1973-01-09  29.90000
    36106   -33.90000   149.81070   1973-01-09  27.60000
    36107   -33.90000   149.84856   1973-01-09  24.40000
    36108   -33.90000   149.88641   1973-01-09  23.80000
    36109   -33.90000   149.92429   1973-01-09  23.80000
    36110   -33.90000   149.96214   1973-01-09  24.10000
    36111   -33.90000   150.00000   1973-01-09  25.30000

DF2 with hourly data (Index = R_ID):

     latitude   longitude    time                T_SFC
R_ID                
0   -39.20000   140.80000   1972-01-20 00:00:00 15.80000
0   -39.20000   140.80000   1972-01-20 01:00:00 15.90000
0   -39.20000   140.80000   1972-01-20 02:00:00 16.00000
0   -39.20000   140.80000   1972-01-20 03:00:00 16.20000
0   -39.20000   140.80000   1972-01-20 04:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 05:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 06:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 07:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 08:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 09:00:00 16.40000
0   -39.20000   140.80000   1972-01-20 10:00:00 16.40000
0   -39.20000   140.80000   1972-01-20 11:00:00 16.40000
0   -39.20000   140.80000   1972-01-20 12:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 13:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 14:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 15:00:00 16.70000
0   -39.20000   140.80000   1972-01-20 16:00:00 16.70000
0   -39.20000   140.80000   1972-01-20 17:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 18:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 19:00:00 16.60000
0   -39.20000   140.80000   1972-01-20 20:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 21:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 22:00:00 16.50000
0   -39.20000   140.80000   1972-01-20 23:00:00 16.40000
0   -39.20000   140.80000   1972-01-21 00:00:00 16.40000
0   -39.20000   140.80000   1972-01-21 01:00:00 16.30000
0   -39.20000   140.80000   1972-01-21 02:00:00 16.30000
0   -39.20000   140.80000   1972-01-21 03:00:00 16.30000
0   -39.20000   140.80000   1972-01-21 04:00:00 16.10000
0   -39.20000   140.80000   1972-01-21 05:00:00 16.00000
... ... ... ... ...
36111   -38.87551   141.14075   1974-12-30 18:00:00 14.10000
36111   -38.87551   141.14075   1974-12-30 19:00:00 14.10000
36111   -38.87551   141.14075   1974-12-30 20:00:00 14.10000
36111   -38.87551   141.14075   1974-12-30 21:00:00 14.10000
36111   -38.87551   141.14075   1974-12-30 22:00:00 14.20000
36111   -38.87551   141.14075   1974-12-30 23:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 00:00:00 14.40000
36111   -38.87551   141.14075   1974-12-31 01:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 02:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 03:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 04:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 05:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 06:00:00 14.60000
36111   -38.87551   141.14075   1974-12-31 07:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 08:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 09:00:00 14.40000
36111   -38.87551   141.14075   1974-12-31 10:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 11:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 12:00:00 14.40000
36111   -38.87551   141.14075   1974-12-31 13:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 14:00:00 14.40000
36111   -38.87551   141.14075   1974-12-31 15:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 16:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 17:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 18:00:00 14.30000
36111   -38.87551   141.14075   1974-12-31 19:00:00 14.40000
36111   -38.87551   141.14075   1974-12-31 20:00:00 14.50000
36111   -38.87551   141.14075   1974-12-31 21:00:00 14.60000
36111   -38.87551   141.14075   1974-12-31 22:00:00 14.70000
36111   -38.87551   141.14075   1974-12-31 23:00:00 14.80000

DF1 has a Date column with daily values from 1972-01-20 to 1974-12-31 while DF2 has a Time column with hourly values from 1972-01-20T00:00:00 to 1974-12-31T23:00:00. DF1 is sorted by RNK (Rank) and FFDI while DF2 is sorted by R_ID and time. One R_ID is a reference ID that corresponds to one unique pair of latitude and longitude. DF2 will be joined to DF1 with the same R_ID and the same Date that DF2 's time column belongs to. That is each row (day) in DF1 will have 24 (hours) rows from DF2 with the same value of day.

The output df will look like:

                                                              time                   T_SFC
RNK  R_ID       latitude    longitude   Date        FFDI
1    0          -39.20000   140.80000   1973-04-02  5.40000   1973-04-02 00:00:00    13.8
                                                              1973-04-02 01:00:00    13.9
                                                              1973-04-02 02:00:00    13.0
                                                              1973-04-02 03:00:00    13.2
                                                              1973-04-02 04:00:00    13.6
                                                              ... ... ... ...
     1          -39.20000   140.83786   1973-04-02  5.40000   1973-04-02 00:00:00    13.8
                                                              1973-04-02 01:00:00    13.9
                                                              1973-04-02 02:00:00    13.0
                                                              1973-04-02 03:00:00    13.2
                                                              1973-04-02 04:00:00    13.6
                                                              ... ... ... ...
     2          -39.20000   140.87572   1973-04-02  5.40000   1973-04-02 00:00:00    13.8
                                                              1973-04-02 01:00:00    13.9
                                                              1973-04-02 02:00:00    13.0
                                                              1973-04-02 03:00:00    13.2
                                                              1973-04-02 04:00:00    13.6
                                                              ... ... ... ...
     ... ... ... ...
2    0          -39.20000   140.80000   1974-03-07  5.60000   1974-03-07 00:00:00    15.8
                                                              1974-03-07 01:00:00    15.9
                                                              1974-03-07 02:00:00    16.0
                                                              1974-03-07 03:00:00    16.2
                                                              1974-03-07 04:00:00    16.6
                                                              ... ... ... ...
     1          -39.20000   140.83786   1973-03-09  5.40000   1973-03-09 00:00:00    15.8
                                                              1973-03-09 01:00:00    15.9
                                                              1973-03-09 02:00:00    16.0
                                                              1973-03-09 03:00:00    15.2
                                                              1973-03-09 04:00:00    15.6
                                                              ... ... ... ...
... ... ... ...
... ... ... ...
5    36082     -33.90000    148.90205   1972-12-24  35.70000  1972-12-24 00:00:00    19.8
                                                              1972-12-24 01:00:00    19.1
                                                              1972-12-24 02:00:00    22.0
                                                              1972-12-24 03:00:00    24.2
                                                              1972-12-24 04:00:00    21.6
                                                              ... ... ... ...
     ... ... ... ...
     36111     -33.90000    150.00000   1973-01-09  25.30000  1973-01-09 00:00:00    19.8
                                                              1973-01-09 01:00:00    19.1
                                                              1973-01-09 02:00:00    22.0
                                                              1973-01-09 03:00:00    24.2
                                                              1973-01-09 04:00:00    21.6
                                                              ... ... ... ...
                                                              1973-01-09 23:00:00    19.1
4,333,440 rows x 2 columns

Following @politinsa answer, I tried

# Add a new column Date and save date part of the time column to it.
df2['Date'] = df2['time'].dt.date.astype('datetime64[ns]')

df_joined = pd.merge(df1, df2, on=['REF_ID', 'Date'], how='inner')

The issue with the output is the multiindex from df1 was not kept with RNK missing from the output df.

print(df_joined)

        time    FFDI         latitude   longitude   T_SFC       time_original
REF_ID                      
0   1973-04-02  5.40000     -39.20000   140.80000   16.40000    1973-04-02 00:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   16.00000    1973-04-02 01:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.70000    1973-04-02 02:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.40000    1973-04-02 03:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.20000    1973-04-02 04:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 05:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 06:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 07:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 08:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 09:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 10:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.20000    1973-04-02 11:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.20000    1973-04-02 12:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.20000    1973-04-02 13:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.00000    1973-04-02 14:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.10000    1973-04-02 15:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.30000    1973-04-02 16:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.40000    1973-04-02 17:00:00
0   1973-04-02  5.40000     -39.20000   140.80000   15.40000    1973-04-02 18:00:00

... ... ... ...
12000 rows × 6 columns

You could create a column in DF2 containing the date (instead of the date time ), ie at row 1973-04-02 01:00:00 you'd have a column Date containing 1973-04-02 .

Then use a classical inner join ( pd.merge(df1, df2, on=['R_ID', 'Date'], how='inner') ) and it should do the trick.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM