简体   繁体   中英

Python Pandas - How to match data from one dataframe to another

I have two dataframes related to stocks and their prices that I'm trying to cross-match data from each dataframe.

df1 = database of users who have each chosen a number of stocks:

  Username Stock 1 Stock 2
0   JB3004    TSLA    MSFT
1   JM3009    SHOP    SPOT
2   DB0208    TWTR    MSFT
3   AB3011    TWTR    PTON
4   CB3004    MSFT    TSLA

df2 = Today's close price for each of the stocks:

               TWTR      SPOT      PTON      SHOP      MSFT      TSLA
Date           Adj Close Adj Close Adj Close Adj Close Adj Close Adj Close
2020-12-11     51.44     341.22     117.1   1057.87    213.26    609.99

I'm trying to match the relevant stocks for each user in df1 to the Adj Close price in df2 so that I can print a df3 with the correct closing price for the stocks each user has chosen.

How would I do this? Everything I've tried doesn't come close, so need some help!

I have faced similar problems. Then I got a solution which I am sharing with you. Hope this will help you get your answer. To see my solution, click on github

Create df1

data1 = {"Username" : ["JB3004", "JM3009", "DB0208", "AB3011", "CB3004"],
      "Stock_1" : ["TSLA", "SHOP", "TWTR", "TWTR", "MSFT"],
      "Stock_2" : ["MSFT", "SPOT", "MSFT", "PTON", "TSLA"]}

df1 = pd.DataFrame(data=data1)
df1.head()
   Username Stock_1 Stock_2
0   JB3004  TSLA    MSFT
1   JM3009  SHOP    SPOT
2   DB0208  TWTR    MSFT
3   AB3011  TWTR    PTON
4   CB3004  MSFT    TSLA

Convert wide format to long format data

df1_1 = pd.wide_to_long(df1, stubnames='Stock_', i='Username', j='Stock_num')
df1_1.reset_index(inplace=True)
df1_1
   Username Stock_num   Stock_
0   JB3004     1        TSLA
1   JM3009     1        SHOP
2   DB0208     1        TWTR
3   AB3011     1        TWTR
4   CB3004     1        MSFT
5   JB3004     2        MSFT
6   JM3009     2        SPOT
7   DB0208     2        MSFT
8   AB3011     2        PTON
9   CB3004     2        TSLA

rename the column name Stock_ to Stocks

df1_1.rename(columns={"Stock_": "Stocks"}, inplace=True)
df1_1

Create df2 to match your df2

closing_price.csv file contains the closing price data

# closing_price.csv
,TWTR,SPOT,PTON,SHOP,MSFT,TSLA
Date,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close,Adj Close
2020-12-11,51.44,341.22,117.1,1057.87,213.26,609.99

Load df2

df2 = pd.read_csv("closing_price.csv", index_col=None)
df2.head()
Unnamed: 0       TWTR    SPOT         PTON        SHOP        MSFT        TSLA
0      Date   Adj Close Adj Close   Adj Close   Adj Close   Adj Close   Adj Close
1   2020-12-11  51.44    341.22       117.1      1057.87      213.26      609.99

Data cleaning and transformation

df2.set_index("Unnamed: 0", inplace = True)
df2.index.name = "Date"
df2.reset_index(inplace=True)
df2.drop([0], inplace=True)
df2.head()
       Date     TWTR    SPOT    PTON    SHOP    MSFT    TSLA
1   2020-12-11  51.44   341.22  117.1   1057.87 213.26  609.99

Convert wide format to long format data

# Convert wide format to long format data
df2_1 = pd.melt(df2, id_vars=['Date'], value_vars=["TWTR", "SPOT", "PTON", "SHOP", "MSFT", "TSLA"], var_name="Stocks", value_name="Adj Close")
df2_1
       Date    Stocks   Adj Close
0   2020-12-11  TWTR    51.44
1   2020-12-11  SPOT    341.22
2   2020-12-11  PTON    117.1
3   2020-12-11  SHOP    1057.87
4   2020-12-11  MSFT    213.26
5   2020-12-11  TSLA    609.99

Now, df1_1 and df2_1 are as below:

df1_1
   Username Stock_num   Stocks
0   JB3004     1         TSLA
1   JM3009     1         SHOP
2   DB0208     1         TWTR
3   AB3011     1         TWTR
4   CB3004     1         MSFT
5   JB3004     2         MSFT
6   JM3009     2         SPOT
7   DB0208     2         MSFT
8   AB3011     2         PTON
9   CB3004     2         TSLA
df2_1
      Date     Stocks   Adj Close
0   2020-12-11  TWTR    51.44
1   2020-12-11  SPOT    341.22
2   2020-12-11  PTON    117.1
3   2020-12-11  SHOP    1057.87
4   2020-12-11  MSFT    213.26
5   2020-12-11  TSLA    609.99

Merge df1_1 and df2_1 on column "Stocks"

# Merge df1_1 and df2_1 on column "Stocks"
df3 = pd.merge(df1_1, df2_1, on='Stocks')
df3
   Username Stock_num   Stocks  Date       Adj Close
0   JB3004     1        TSLA    2020-12-11  609.99
1   CB3004     2        TSLA    2020-12-11  609.99
2   JM3009     1        SHOP    2020-12-11  1057.87
3   DB0208     1        TWTR    2020-12-11  51.44
4   AB3011     1        TWTR    2020-12-11  51.44
5   CB3004     1        MSFT    2020-12-11  213.26
6   JB3004     2        MSFT    2020-12-11  213.26
7   DB0208     2        MSFT    2020-12-11  213.26
8   JM3009     2        SPOT    2020-12-11  341.22
9   AB3011     2        PTON    2020-12-11  117.1

Rearrange columns

# Rearrange columns
df3.set_index(["Date"], inplace=True)
df3.reset_index(inplace=True)
df3
       Date    Username   Stock_num Stocks  Adj Close
0   2020-12-11  JB3004       1      TSLA    609.99
1   2020-12-11  CB3004       2      TSLA    609.99
2   2020-12-11  JM3009       1      SHOP    1057.87
3   2020-12-11  DB0208       1      TWTR    51.44
4   2020-12-11  AB3011       1      TWTR    51.44
5   2020-12-11  CB3004       1      MSFT    213.26
6   2020-12-11  JB3004       2      MSFT    213.26
7   2020-12-11  DB0208       2      MSFT    213.26
8   2020-12-11  JM3009       2      SPOT    341.22
9   2020-12-11  AB3011       2      PTON    117.1
# Reshaping or pivoting data based on column values
df = df3.pivot(index="Username", columns="Stock_num", values=["Stocks", "Adj Close"])
df
           Stocks        Adj Close
Stock_num 1      2       1       2
Username                
AB3011  TWTR    PTON    51.44   117.1
CB3004  MSFT    TSLA    213.26  609.99
DB0208  TWTR    MSFT    51.44   213.26
JB3004  TSLA    MSFT    609.99  213.26
JM3009  SHOP    SPOT    1057.87 341.22

Just saw this and I thought I'd give it a whirl.

Use pandas.DataFrame.stack() on df2 to align everything with df1. Rename some fields, if you want.

df2t = df2.stack().reset_index().rename(
        columns={
                "level_0":"date",
                "level_1":"stock",
                0:"closing_price",
                },
        )

df2t = df2t.loc[df2t["date"] != "Date", :]

Data -

          date stock closing_price
6   2020-12-11  TWTR         51.44
7   2020-12-11  SPOT        341.22
8   2020-12-11  PTON         117.1
9   2020-12-11  SHOP       1057.87
10  2020-12-11  MSFT        213.26
11  2020-12-11  TSLA        609.99

pandas.melt() on df1

df1m = pd.melt(df1, id_vars=["username"], value_vars=["Stock 1", "Stock 2"])

Data -

  username variable value
0   JB3004  Stock 1  TSLA
1   JM3009  Stock 1  SHOP
2   DB0208  Stock 1  TWTR
3   AB3011  Stock 1  TWTR
4   CB3004  Stock 1  MSFT
5   JB3004  Stock 2  MSFT
6   JM3009  Stock 2  SPOT
7   DB0208  Stock 2  MSFT
8   AB3011  Stock 2  PTON
9   CB3004  Stock 2  TSLA

Merge the dataframes.

df = pd.merge(df1m, df2t, left_on="value", right_on="stock", sort=False)

Data -

  username variable value        date stock closing_price
0   JB3004  Stock 1  TSLA  2020-12-11  TSLA        609.99
1   CB3004  Stock 2  TSLA  2020-12-11  TSLA        609.99
2   JM3009  Stock 1  SHOP  2020-12-11  SHOP       1057.87
3   DB0208  Stock 1  TWTR  2020-12-11  TWTR         51.44
4   AB3011  Stock 1  TWTR  2020-12-11  TWTR         51.44
5   CB3004  Stock 1  MSFT  2020-12-11  MSFT        213.26
6   JB3004  Stock 2  MSFT  2020-12-11  MSFT        213.26
7   DB0208  Stock 2  MSFT  2020-12-11  MSFT        213.26
8   JM3009  Stock 2  SPOT  2020-12-11  SPOT        341.22
9   AB3011  Stock 2  PTON  2020-12-11  PTON         117.1

Do some cleanup and then pivot for usable results

df = df.drop("value", axis=1).rename(columns={"variable": "holding_id"})
df = df.pivot(index="username", columns="holding_id", values=["stock", "closing_price"]).rename(columns=lambda x: x.strip())

Data -

             stock         closing_price        
holding_id Stock 1 Stock 2       Stock 1 Stock 2
username                                        
AB3011        TWTR    PTON         51.44   117.1
CB3004        MSFT    TSLA        213.26  609.99
DB0208        TWTR    MSFT         51.44  213.26
JB3004        TSLA    MSFT        609.99  213.26
JM3009        SHOP    SPOT       1057.87  341.22

Selecting data is pretty simple with multiindexing

df.loc[:,"stock"]["Stock 1"]

Data -

username
AB3011    TWTR
CB3004    MSFT
DB0208    TWTR
JB3004    TSLA
JM3009    SHOP
Name: Stock 1, dtype: object

Or, include username for targeted selections:

df.loc["AB3011","stock"]["Stock 1"]

Data -

'TWTR'

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM