简体   繁体   中英

Pandas Merging/Join Dataframes

I am not sure if this means merging or joining.

I have a dataframe with following columns ['times','spots'] , DataFrame A and I have another dataframe with similar columns ['times','spots'] , DataFrame B.

I want to change dataframe A so that it has values from dataframe B interpolated at those times from A. So it would have a new column spots_B in dataframe A.

Well, I'm going to go out on a ledge and show you how to merge with suffixes:

# First import our libraries
>>> import pandas as pd
>>> import numpy as np
# Then create our dataframes
>>> df_A = pd.DataFrame(np.random.rand(3,2),columns=['times','spots'])
>>> df_B = pd.DataFrame(np.random.rand(3,2),columns=['times','spots'])
# Set default values
>>> df_A['times'] = [1,2,3]
>>> df_B['times'] = [1,2,3]
>>> df_A['spots'] = [44,55,66]
>>> df_B['spots'] = [77,88,99]
# Here is what both dataframes contain
>>> df_A
   times  spots
0      1     44
1      2     55
2      3     66
>>> df_B
   times  spots
0      1     77
1      2     88
2      3     99
# Now the merge -- note: this does not affect the first dataframe in place.
#                        It will create a new dataframe. You can overwrite the 
#                        first if you set the result to df_A instead of df_merged.
# Note the use of the keyword, suffixes. In the event that the same column names exist
#  in both dataframes (that aren't being merged on) Pandas will need to differentiate
#  between them. By default same column names will result in a '_x' will be appended to
#  the left dataframe column name, and a '_y' to the right dataframe column name
#  [order is set by the first two arguments in the merge function]. 
#  The suffixes keyword allows the user to override this behaviour with their
#  own version of '_x' and '_y'.
>>> df_merged = pd.merge(df_A,df_B,how='inner',on=['times'],suffixes=['_A','_B'])
>>> df_merged
   times  spots_A  spots_B
0      1       44       77
1      2       55       88
2      3       66       99

Now it seems from your question that you didn't want to modify the first dataframes column name for spots. This can be achieved in the same fashion, except instead of using suffixes=['_A','_B'] use suffixes=['','_B'] . This essentially sets the left dataframe column suffix to nothing, so it stays the same:

>>> df_merged = pd.merge(df_A,df_B,how='inner',on=['times'],suffixes=['','_B'])
>>> df_merged
   times  spots  spots_B
0      1     44       77
1      2     55       88
2      3     66       99

Voila! I hope this helped. If I misunderstood, and you were actually looking for the interpolated values between A and B, let me know and I will edit this answer.

* Edit 1 *

Taking into consideration your last comment, here is what I believe you are trying to accomplish. Below I will show you how to extend the merging with suffixes and then fill NaNs in spots_B with interpolated values using the 'time' interpolation method

# Start by creating out datetimes to set for the times column
>>> times_A = []
>>> times_B = []
>>> for i in range(1,4):
...   times_A.append(datetime.datetime(year=2011,month=5,day=i))
...
>>> for i in range(1,6,2):
...   times_B.append(datetime.datetime(year=2011,month=5,day=i))
...
# times_A: May 1st, 2011 - May 3rd, 2011
>>> times_A
[datetime.datetime(2011, 5, 1, 0, 0), datetime.datetime(2011, 5, 2, 0, 0), datetime.datetime(2011, 5, 3, 0, 0)]
# times_B: May 1st 2011, May 3rd 2011, May 5th 2011
>>> times_B
[datetime.datetime(2011, 5, 1, 0, 0), datetime.datetime(2011, 5, 3, 0, 0), datetime.datetime(2011, 5, 5, 0, 0)]
# So now times_B is missing May 2nd, and has an extra time, May 5th.
>>> df_A['times'] = times_A
>>> df_B['times'] = times_B
>>> df_A['spots'] = [44,55,66]
>>> df_B['spots'] = [44,66,88]
>>> df_A
                times  spots
0 2011-05-01 00:00:00     44
1 2011-05-02 00:00:00     55
2 2011-05-03 00:00:00     66
>>> df_B
                times  spots
0 2011-05-01 00:00:00     44
1 2011-05-03 00:00:00     66
2 2011-05-05 00:00:00     88

# Now it appears you only care about the times in df_A - so
#   left merge df_A with df_B (include all times from df_A and  
#   try to merge with df_B or NaN). Below the date May 5th was dropped.
>>> df_merged = pd.merge(df_A,df_B,how='left',on=['times'],suffixes=['','_B'])
>>> df_merged
                times  spots  spots_B
0 2011-05-01 00:00:00     44       44
1 2011-05-02 00:00:00     55      NaN
2 2011-05-03 00:00:00     66       66

# Here is the important part:
# Since it appears that your data is going to be a time series
#   you will need to set your dataframe index to be the times column.
>>> df_merged = df_merged.set_index(['times'])
>>> df_merged
            spots  spots_B
times
2011-05-01     44       44
2011-05-02     55      NaN
2011-05-03     66       66

# With the times as index we can use the appropriate
#   interpolation method for best results
>>> df_merged['spots_B'] = df_merged['spots_B'].interpolate(method='time')
>>> df_merged
            spots  spots_B
times
2011-05-01     44       44
2011-05-02     55       55
2011-05-03     66       66

Note: The default behavior of interpolate() on a Series is to assume each row is equi-distant. If your times are not going to be equally spaced apart, you will need to reindex your dataframe with a TimeSeries index. When the index is a timeseries then you can use the method='time' argument in the interpolate() function.

Note2: Please try to provide as much detail when asking a question. It helps the person who wants to answer to fully understand your issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM