简体   繁体   中英

Python/Pandas merge issue with NaN data

I am trying to use pd.concat to merge two data frames ( df and df2 ) into one new dataframe ( df3 ) in Pandas using the following code:

df3 = pd.concat([df, df2])

This almost works the way I want it to, but it creates an issue.

df contains data for the current date, and the index is a time series. It looks like this:

                        Facility    Servers   PUE
2016-10-31  00:00:00    6.0         5.0       1.2
2016-10-31  00:30:00    7.0         5.0       1.4
2016-10-31  01:00:00    6.0         5.0       1.2
2016-10-31  01:30:00    6.0         5.0       1.2
2016-10-31  02:00:00    6.0         5.0       1.2

df2 contains NaN data only and the index is a time series that corresponds in format to the one in df , but beginning on an earlier date and continuing for a full year (ie 17520 rows to correspond to 365 * 48 thirty minute intervals). It looks basically like this:

                        Facility    Servers   PUE
2016-10-01  00:00:00    NaN         NaN       NaN
2016-10-01  00:30:00    NaN         NaN       NaN
2016-10-01  01:00:00    NaN         NaN       NaN
2016-10-01  01:30:00    NaN         NaN       NaN
2016-10-01  02:00:00    NaN         NaN       NaN
2016-10-01  02:30:00    NaN         NaN       NaN
<continues to 17520 rows, i.e. one year of 30 minute time intervals>

When I apply: df3 = pd.concat([df, df2])

and then run df3.head() , I get the following:

                        Facility    Servers   PUE
2016-10-31  00:00:00    6.0         5.0       1.2
2016-10-31  00:30:00    7.0         5.0       1.4
2016-10-31  01:00:00    6.0         5.0       1.2
2016-10-31  01:30:00    6.0         5.0       1.2
2016-10-31  02:00:00    6.0         5.0       1.2
2016-10-31  02:30:00    NaN         NaN       NaN
2016-10-31  03:00:00    NaN         NaN       NaN
2016-10-31  03:30:00    NaN         NaN       NaN
<continues to the end of the year>

In other words, the code seems to remove all of the NaN data for the time intervals that occurred before the data in df . Can anyone advise on how to preserve all of the data in df2 , replacing it with data only for the corresponding time intervals from df ?

I think you need reindex by union of both indexes :

print (df2.index.union(df.index))
DatetimeIndex(['2016-10-01 00:00:00', '2016-10-01 00:30:00',
               '2016-10-01 01:00:00', '2016-10-01 01:30:00',
               '2016-10-01 02:00:00', '2016-10-01 02:30:00',
               '2016-10-31 00:00:00', '2016-10-31 00:30:00',
               '2016-10-31 01:00:00', '2016-10-31 01:30:00',
               '2016-10-31 02:00:00'],
              dtype='datetime64[ns]', freq=None)

df = df.reindex(df2.index.union(df.index))
print (df)
                     Facility  Servers  PUE
2016-10-01 00:00:00       NaN      NaN  NaN
2016-10-01 00:30:00       NaN      NaN  NaN
2016-10-01 01:00:00       NaN      NaN  NaN
2016-10-01 01:30:00       NaN      NaN  NaN
2016-10-01 02:00:00       NaN      NaN  NaN
2016-10-01 02:30:00       NaN      NaN  NaN
2016-10-31 00:00:00       6.0      5.0  1.2
2016-10-31 00:30:00       7.0      5.0  1.4
2016-10-31 01:00:00       6.0      5.0  1.2
2016-10-31 01:30:00       6.0      5.0  1.2
2016-10-31 02:00:00       6.0      5.0  1.2

use combine_first

result = df1.combine_first(df2)

result will only takes values from the right DataFrame if they are missing in the left DataFrame

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM