简体   繁体   中英

Merging columns from multiple dataframes by specified column

I'm working on time-series and i have 10 different stock prices in csv files. What I'm trying to do is simply dump their Close prices in a dataframe and name the column with the name of the stock.

I did it manually but there should be better ways. And I also have all other columns. Here's what i did so far. I need them to be matched by Date. If one of them missing the other's date, it should have NaN values so i can drop them easily.

Here's what i did so far:

sym1 = "AAPL"
sym2 = "AMZN"
s1 = "./stocks/{}.csv".format(sym1)
s2 = "./stocks/{}.csv".format(sym2)
df = pd.read_csv(s1)
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
ff = pd.read_csv(s2)
ff = ff.reindex(df.index, fill_value=np.nan)
ff[sym1] = df['Close']
ff[sym2] = ff['Close']
print(ff[[sym1, sym2]].tail())

As long as you have both datasets stored as dataframes with a common index (of the same datatype), you can use pd.merge() like so:

df2 = pd.merge(df, ff, how='left',left_index = True, right_index = True)

The missing values in your final dataframe will depend on your dataset and how you join them specified by the how='left' part. Below is an example that builds on 4 random series that are concatenated (simple merge) two by two, and then joined into a single dataframe with some missing values.

Using left_index = True, right_index = True will specify that they are merged on your date index. I'd prefer to do it like that since it seems by your example that you'd like to use date indexes. Of you, like you say in the title of your question, would like to merge the data by arbitrary columns, you can specify them using on . But that is not necessary since it's pretty clear that you are merging your data on dates, and the natural way to store them are as indexes in your dataframes.

Snippet:

# Imports
import pandas as pd
import numpy as np

# sample data
np.random.seed(123)
AAPL = pd.Series(np.random.randn(100),index=pd.date_range('1/1/2000', periods=100)).cumsum()
AMZN = pd.Series(np.random.randn(100),index=pd.date_range('1/1/2000', periods=100)).cumsum()
MSFT = pd.Series(np.random.randn(100),index=pd.date_range('3/1/2000', periods=100)).cumsum()
RNDM = pd.Series(np.random.randn(100),index=pd.date_range('3/1/2000', periods=100)).cumsum()

# two dataframes with a common index
df = pd.concat([AAPL, AMZN], axis = 1)
df.columns = ['AAPL', 'AMZN']
ff = pd.concat([MSFT, RNDM], axis = 1)
ff.columns = ['MSFT', 'RNDM']

# merged dataframe from two dataframes
# that do not perfectly share a common index
dfm = pd.merge(df, ff, how='left', left_index=True, right_index=True)
dfm.head()

Output:

               AAPL      AMZN  MSFT  RNDM
2000-01-01 -1.085631  0.642055   NaN   NaN
2000-01-02 -0.088285 -1.335833   NaN   NaN
2000-01-03  0.194693 -0.623569   NaN   NaN
2000-01-04 -1.311601  1.974735   NaN   NaN
2000-01-05 -1.890202  1.950109   NaN   NaN

Plot: using dfm.plot() :

在此输入图像描述

As you can see, MSFT and RNDM don't have any observations prior to the month of march. So, what to do with all those missing values? That depends entirely on the structure of your dataset and the reason the data is missing. Take a look at What to do with missing values when plotting with seaborn? for some advice and a brief introduction on how to handle missing data in pandas dataframes.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM