简体   繁体   中英

Create new dataframe using “VLOOKUP” between two dataframes

Somewhat similar to Excel's VLOOKUP function, I am wanting to use a value in one dataframe ( portfolios below) to find an associated value in a second dataframe ( returns below) and populate a third dataframe (let's call this dataframe3 for now) with these returned values. I have found several posts based on left merges and map , but my original two dataframes are of different structures, so these methods don't seem to fit (to me, at least).

I haven't made much progress, but here is the code I have:

Code

import pandas as pd

portfolios = pd.read_csv('portstst5_1.csv')
returns = pd.read_csv('Example_Returns.csv')

total_cols = len(portfolios.columns)
headers = list(portfolios)

concat = returns['PERMNO'].map(str) + returns['FROMDATE'].map(str)
idx = 2
returns.insert(loc=idx, column="concat", value=concat)

for i in range(total_cols):
    col_len = portfolios.iloc[:,i].count()
    for j in range(col_len):
        print(portfolios.iat[j,i].astype('int').astype('str') + headers[i])

Data

This code will make a little more sense if I first describe my data: portfolios is a dataframe with 13 columns of varying lengths. The column headers are dates in YYYYMMDD format. Below each date header are identifiers which are five digit numeric codes. A snippet of portfolios looks like this (some elements in some columns contain NaN):

    20131231  20131130  20131031  20130930  20130831  20130731  20130630  \
0    93044.0   93044.0   13264.0   13264.0   89169.0   82486.0   91274.0   
1    79702.0   91515.0   90710.0   81148.0   47387.0   88359.0   93353.0   
2    85751.0   85724.0   88810.0   11513.0   85576.0   47387.0   85576.0

The data in returns data originally consists of three columns and 799 rows and looks like this (all elements are populated with values):

     PERMNO  FROMDATE     MORET
0     93044  20131231 -0.022304
1     79702  20131231  0.012283
2     85751  20131231 -0.016453
3     85576  20131231  0.038766

Desired Output

I would like to make a third dataframe that is structured identically to portfolios . That is, it will have the same column header dates and the same number of rows in each column as does portfolios , but instead of identifiers, it will contain the MORET for the appropriate identifier/date combination. This is the reason for the concatenations in my code above - I am trying (perhaps unnecessarily) to create unique lookup values so I can communicate between portfolios and returns . For example, to populate dataframe3[0,0] , I would look for the concatenated values from portfolios[0,0] and headers[0] (ie 9304420131231) in returns['concat'] and return the associated value in returns['MORET'] (ie -0.022304). I am stuck here on how to use the concatenated values to return my desired data.

Any thoughts are greatly appreciated.

What you are trying to do is much simpler than how you tried doing it. You can first melt portfolios to flip it and collect all the date columns as rows in a single column, then join it with returns , and finally pivot to get the desired result. This is basically what @djk47463 did in one compound line, and my edited answer serves as a step-by-step breakdown of his.

Let's create your DataFrames to make the answer reproducible.

import pandas as pd
import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

# Create df
rawText = StringIO("""
     PERMNO  FROMDATE     MORET
0     93044  20131231 -0.022304
1     79702  20131231  0.012283
2     85751  20131231 -0.016453
3     85576  20131231  0.038766
4     93044  20131010 -0.02
5     79702  20131010  0.01
6     85751  20131010 -0.01
7     85576  20131010  0.03
""")
returns = pd.read_csv(rawText, sep = "\s+")
portfolios = pd.DataFrame({'20131010':[93044, 85751],
                       '20131231':[85576, 79702]})

Notice, the FROMDATE column of returns consists of numbers, but in portfolios the date columns are strings. We must make them consistent:

df.FROMDATE = df.FROMDATE.astype(str)

Let's start the solution by melt ing (ie unpivot) portfolios :

portfolios = portfolios.melt(var_name=['FROMDATE'],value_name='PERMNO')
# portfolios: 
   FROMDATE  PERMNO
0  20131010   93044
1  20131010   85751
2  20131231   85576
3  20131231   79702

Now you want to hold this pm constant, and merge returns to its lines whenever their PERMNO s and FROMDATE s match:

merged = pm.merge(df, how='left', on=['PERMNO', 'FROMDATE'])
# merged: 
   FROMDATE  PERMNO     MORET
0  20131010   93044 -0.020000
1  20131010   85751 -0.010000
2  20131231   85576  0.038766
3  20131231   79702  0.012283

Remember we had melt ed (unpivoted) the portfolios at the beginning? We should pivot this result to give it the shape of portfolios :

final = merged.pivot(index='PERMNO', columns='FROMDATE', values='MORET').reset_index()
# final: 
FROMDATE  PERMNO  20131010  20131231
0          79702       NaN  0.012283
1          85576       NaN  0.038766
2          85751     -0.01       NaN
3          93044     -0.02       NaN

IIUC:

Using a combination of melt so the we can merge values from returns by desired columns. Then use pivot to reshape the data back, as seen below.

portfolios.columns = portfolios.columns.astype(int)
newdf = portfolios.reset_index().melt(id_vars='index',var_name=['FROMDATE'],value_name='PERMNO').merge(returns,on=['FROMDATE','PERMNO'],how='left').pivot(index='index',columns='FROMDATE',values='MORET')

Which returnsthe DataFrame below

FROMDATE  20130630  20130731  20130831  20130930  20131031  20131130  20131231
index
0              NaN       NaN       NaN       NaN       NaN       NaN -0.022304
1              NaN       NaN       NaN       NaN       NaN       NaN  0.012283
2              NaN       NaN       NaN       NaN       NaN       NaN -0.016453

Sort columns

newdf.loc[:,newdf.columns.sort_values(ascending=False)]

The typical way to do a vlookup in python is to create a series with what would be your left column in the index, and then slice that series by the lookup value. The NaNs complicate it a little. We'll make a series from returns by using the set_index method to set PERMNO as the index for the dataframe, and then slicing by the column name to isolate the MORET column as a series.

lookupseries = returns.set_index('PERMNO')['MORET']
def lookup(x):
    try: 
        return lookupseries[x]
    except: 
        return np.nan
newdf = portfolios.copy()
for c in newdf.columns:
    newdf[c] = newdf[c].apply(lookup)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM