简体   繁体   中英

Overwrite one dataframe with values from another dataframe, based on repeated datetime index

I want to update and overwrite the values of one dataframe with the values in another, based on the datetime index, for a repeated datetime index. This code illustrates my problem, I have given df1 crazy values for illustrative purposes:

#import packages
import pandas as pd
import numpy as np

#create dataframes and indices
df = pd.DataFrame(np.random.randint(0,30,size=(10, 3)), columns=(['MeanT', 'MaxT', 'MinT']))
df1 = pd.DataFrame(np.random.randint(900,1000,size=(10, 3)), columns=(['MeanT', 'MaxT', 'MinT']))

df['Location'] =[2,2,3,3,4,4,5,5,6,6]
df1['Location'] =[2,2,3,3,4,4,5,5,6,6]

df.index = ["2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00","2020-05-18 12:00:00","2020-05-19 12:00:00"]
df1.index = ["2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00", "2020-05-19 12:00:00", "2020-05-20 12:00:00"]

df.index = pd.to_datetime(df.index)
df1.index = pd.to_datetime(df1.index)

Take a look at both dataframes, which shows dates 18th and 19th for df, and 19th and 20th for df1.

print(df)
                     MeanT  MaxT  MinT  Location
2020-05-18 12:00:00     28     0     9         2
2020-05-19 12:00:00     22     7    11         2
2020-05-18 12:00:00      2     7     7         3
2020-05-19 12:00:00     10    24    18         3
2020-05-18 12:00:00     10    12    25         4
2020-05-19 12:00:00     25     7    20         4
2020-05-18 12:00:00      1     8    11         5
2020-05-19 12:00:00     27    19    12         5
2020-05-18 12:00:00     25    10    26         6
2020-05-19 12:00:00     29    11    27         6

print(df1)
                     MeanT  MaxT  MinT  Location
2020-05-19 12:00:00    912   991   915         2
2020-05-20 12:00:00    936   917   965         2
2020-05-19 12:00:00    918   977   901         3
2020-05-20 12:00:00    974   971   927         3
2020-05-19 12:00:00    979   929   953         4
2020-05-20 12:00:00    988   955   939         4
2020-05-19 12:00:00    969   983   940         5
2020-05-20 12:00:00    902   904   916         5
2020-05-19 12:00:00    983   942   965         6
2020-05-20 12:00:00    928   994   933         6

I want to create a new dataframe which updates df with the values from df1, so the new df has values for the 18th from df, and the 19th and 20th from df1.

I have tried using combine_first like so:

df = df.set_index(df.groupby(level=0).cumcount(), append=True)
df1 = df1.set_index(df1.groupby(level=0).cumcount(), append=True)
 
df3 = df.combine_first(df1).sort_index(level=[1,0]).reset_index(level=1, drop=True)

which updates the dataframe, but doesn't overwrite the data for the 19th with values in df1. It produces this output:

print(df3)
                     MeanT   MaxT   MinT  Location
2020-05-18 12:00:00   28.0    0.0    9.0       2.0
2020-05-19 12:00:00   22.0    7.0   11.0       2.0
2020-05-20 12:00:00  936.0  917.0  965.0       2.0
2020-05-18 12:00:00    2.0    7.0    7.0       3.0
2020-05-19 12:00:00   10.0   24.0   18.0       3.0
2020-05-20 12:00:00  974.0  971.0  927.0       3.0
2020-05-18 12:00:00   10.0   12.0   25.0       4.0
2020-05-19 12:00:00   25.0    7.0   20.0       4.0
2020-05-20 12:00:00  988.0  955.0  939.0       4.0
2020-05-18 12:00:00    1.0    8.0   11.0       5.0
2020-05-19 12:00:00   27.0   19.0   12.0       5.0
2020-05-20 12:00:00  902.0  904.0  916.0       5.0
2020-05-18 12:00:00   25.0   10.0   26.0       6.0
2020-05-19 12:00:00   29.0   11.0   27.0       6.0
2020-05-20 12:00:00  928.0  994.0  933.0       6.0

So the values for the 18th and the 20th are correct, but the values for the 19th are still from df. I want the values from df to be overwritten with those in df1. Please help!

you just need to use combine_first backwards. We can also use 'Location' as index instead groupby.cumcount

df3 = (df1.set_index('Location', append=True)
          .combine_first(df.set_index('Location', append=True))
          .reset_index(level='Location')
          .reindex(columns=df.columns)
          .sort_values('Location'))

print(df3)

                     Location  MeanT   MaxT   MinT
2020-05-18-12:00:00         2   28.0    0.0    9.0
2020-05-19-12:00:00         2  912.0  991.0  915.0
2020-05-20-12:00:00         2  936.0  917.0  965.0
2020-05-18-12:00:00         3    2.0    7.0    7.0
2020-05-19-12:00:00         3  918.0  977.0  901.0
2020-05-20-12:00:00         3  974.0  971.0  927.0
2020-05-18-12:00:00         4   10.0   12.0   25.0
2020-05-19-12:00:00         4  979.0  929.0  953.0
2020-05-20-12:00:00         4  988.0  955.0  939.0
2020-05-18-12:00:00         5    1.0    8.0   11.0
2020-05-19-12:00:00         5  969.0  983.0  940.0
2020-05-20-12:00:00         5  902.0  904.0  916.0
2020-05-18-12:00:00         6   25.0   10.0   26.0
2020-05-19-12:00:00         6  983.0  942.0  965.0
2020-05-20-12:00:00         6  928.0  994.0  933.0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM