在两个数据框之间使用“ VLOOKUP”创建新的数据框

Question

Somewhat similar to Excel's VLOOKUP function, I am wanting to use a value in one dataframe ( portfolios below) to find an associated value in a second dataframe ( returns below) and populate a third dataframe (let's call this dataframe3 for now) with these returned values. 与Excel的VLOOKUP函数有点类似，我想在一个数据框（下面的portfolios ）中使用一个值在第二个数据框（下面的returns ）中找到关联的值，并使用返回的这些数据填充第三个数据框（现在将其称为此dataframe3）值。 I have found several posts based on left merges and map , but my original two dataframes are of different structures, so these methods don't seem to fit (to me, at least). 我发现了一些基于left merges和map帖子，但是我最初的两个数据框具有不同的结构，所以这些方法似乎不合适（至少对我来说）。

I haven't made much progress, but here is the code I have: 我没有太大的进步，但是这是我的代码：

Code 码

import pandas as pd

portfolios = pd.read_csv('portstst5_1.csv')
returns = pd.read_csv('Example_Returns.csv')

total_cols = len(portfolios.columns)
headers = list(portfolios)

concat = returns['PERMNO'].map(str) + returns['FROMDATE'].map(str)
idx = 2
returns.insert(loc=idx, column="concat", value=concat)

for i in range(total_cols):
    col_len = portfolios.iloc[:,i].count()
    for j in range(col_len):
        print(portfolios.iat[j,i].astype('int').astype('str') + headers[i])

Data 数据

This code will make a little more sense if I first describe my data: portfolios is a dataframe with 13 columns of varying lengths. 如果我首先描述我的数据，那么这段代码会更有意义： portfolios是一个具有13个长度可变的列的数据框。 The column headers are dates in YYYYMMDD format. 列标题是YYYYMMDD格式的日期。 Below each date header are identifiers which are five digit numeric codes. 每个日期标题的下面是五位数的数字标识符。 A snippet of portfolios looks like this (some elements in some columns contain NaN): 的片段portfolios看起来像这样（在一些列的一些元素包含NAN）：

    20131231  20131130  20131031  20130930  20130831  20130731  20130630  \
0    93044.0   93044.0   13264.0   13264.0   89169.0   82486.0   91274.0   
1    79702.0   91515.0   90710.0   81148.0   47387.0   88359.0   93353.0   
2    85751.0   85724.0   88810.0   11513.0   85576.0   47387.0   85576.0

The data in returns data originally consists of three columns and 799 rows and looks like this (all elements are populated with values): returns数据中的数据最初由三列和799行组成，看起来像这样（所有元素都填充有值）：

     PERMNO  FROMDATE     MORET
0     93044  20131231 -0.022304
1     79702  20131231  0.012283
2     85751  20131231 -0.016453
3     85576  20131231  0.038766

Desired Output 期望的输出

I would like to make a third dataframe that is structured identically to portfolios . 我想制作第三个与portfolios相同的数据框。 That is, it will have the same column header dates and the same number of rows in each column as does portfolios , but instead of identifiers, it will contain the MORET for the appropriate identifier/date combination. 也就是说，它将与portfolios具有相同的列标题日期和每列中相同的行数，但是它将代替标识符，包含用于适当的标识符/日期组合的MORET 。 This is the reason for the concatenations in my code above - I am trying (perhaps unnecessarily) to create unique lookup values so I can communicate between portfolios and returns . 这就是上面我的代码中进行级联的原因-我正在尝试（可能不必要）创建唯一的查找值，以便我可以在portfolios和returns之间进行沟通。 For example, to populate dataframe3[0,0] , I would look for the concatenated values from portfolios[0,0] and headers[0] (ie 9304420131231) in returns['concat'] and return the associated value in returns['MORET'] (ie -0.022304). 例如，为了填充dataframe3[0,0]我会寻找连接值从portfolios[0,0]和headers[0]中（即9304420131231） returns['concat']和返回相关的值returns['MORET'] （即-0.022304）。 I am stuck here on how to use the concatenated values to return my desired data. 我在这里停留在如何使用级联的值返回我想要的数据。

Any thoughts are greatly appreciated. 任何想法都将不胜感激。

Answer 1

What you are trying to do is much simpler than how you tried doing it. 您尝试做的事情比尝试做的要简单得多。 You can first melt portfolios to flip it and collect all the date columns as rows in a single column, then join it with returns , and finally pivot to get the desired result. 您可以首先融化portfolios以翻转它，并将所有日期列作为一行收集在单个列中，然后将其与returns ，最后进行透视以得到所需的结果。 This is basically what @djk47463 did in one compound line, and my edited answer serves as a step-by-step breakdown of his. 这基本上是@ djk47463在一个复合行中所做的，而我编辑后的答案则是对其的逐步介绍。

Let's create your DataFrames to make the answer reproducible. 让我们创建您的DataFrame，使答案可重复。

import pandas as pd
import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

# Create df
rawText = StringIO("""
     PERMNO  FROMDATE     MORET
0     93044  20131231 -0.022304
1     79702  20131231  0.012283
2     85751  20131231 -0.016453
3     85576  20131231  0.038766
4     93044  20131010 -0.02
5     79702  20131010  0.01
6     85751  20131010 -0.01
7     85576  20131010  0.03
""")
returns = pd.read_csv(rawText, sep = "\s+")
portfolios = pd.DataFrame({'20131010':[93044, 85751],
                       '20131231':[85576, 79702]})

Notice, the FROMDATE column of returns consists of numbers, but in portfolios the date columns are strings. 注意， returns的FROMDATE列由数字组成，但是在portfolios ，日期列是字符串。 We must make them consistent: 我们必须使它们一致：

df.FROMDATE = df.FROMDATE.astype(str)

Let's start the solution by melt ing (ie unpivot) portfolios : 让我们通过melt （即不可分割）的portfolios开始解决方案：

portfolios = portfolios.melt(var_name=['FROMDATE'],value_name='PERMNO')
# portfolios: 
   FROMDATE  PERMNO
0  20131010   93044
1  20131010   85751
2  20131231   85576
3  20131231   79702

Now you want to hold this pm constant, and merge returns to its lines whenever their PERMNO s and FROMDATE s match: 现在，您要保持此pm常数，并在PERMNO和FROMDATE匹配时合并returns其行：

merged = pm.merge(df, how='left', on=['PERMNO', 'FROMDATE'])
# merged: 
   FROMDATE  PERMNO     MORET
0  20131010   93044 -0.020000
1  20131010   85751 -0.010000
2  20131231   85576  0.038766
3  20131231   79702  0.012283

Remember we had melt ed (unpivoted) the portfolios at the beginning? 还记得我们在一开始就melt （毫无保留的） portfolios吗？ We should pivot this result to give it the shape of portfolios : 我们要pivot这个结果给它的形状portfolios ：

final = merged.pivot(index='PERMNO', columns='FROMDATE', values='MORET').reset_index()
# final: 
FROMDATE  PERMNO  20131010  20131231
0          79702       NaN  0.012283
1          85576       NaN  0.038766
2          85751     -0.01       NaN
3          93044     -0.02       NaN

Answer 2

IIUC: IIUC：

Using a combination of melt so the we can merge values from returns by desired columns. 使用melt的组合，这样我们就可以按期望的列merge来自returns的值。 Then use pivot to reshape the data back, as seen below. 然后使用数据pivot将数据重新调整为形状，如下所示。

portfolios.columns = portfolios.columns.astype(int)
newdf = portfolios.reset_index().melt(id_vars='index',var_name=['FROMDATE'],value_name='PERMNO').merge(returns,on=['FROMDATE','PERMNO'],how='left').pivot(index='index',columns='FROMDATE',values='MORET')

Which returnsthe DataFrame below 哪个返回下面的DataFrame

FROMDATE  20130630  20130731  20130831  20130930  20131031  20131130  20131231
index
0              NaN       NaN       NaN       NaN       NaN       NaN -0.022304
1              NaN       NaN       NaN       NaN       NaN       NaN  0.012283
2              NaN       NaN       NaN       NaN       NaN       NaN -0.016453

Sort columns 排序栏

newdf.loc[:,newdf.columns.sort_values(ascending=False)]

Answer 3

The typical way to do a vlookup in python is to create a series with what would be your left column in the index, and then slice that series by the lookup value. 在python中执行vlookup的典型方法是使用索引中的左列创建一个系列，然后通过查找值对该系列进行切片。 The NaNs complicate it a little. NaN使它复杂化了一点。 We'll make a series from returns by using the set_index method to set PERMNO as the index for the dataframe, and then slicing by the column name to isolate the MORET column as a series. 我们将使用set_index方法将PERMNO设置为数据PERMNO的索引，然后按列名称进行切片以将MORET列隔离为序列，从而从returns中set_index一系列序列。

lookupseries = returns.set_index('PERMNO')['MORET']
def lookup(x):
    try: 
        return lookupseries[x]
    except: 
        return np.nan
newdf = portfolios.copy()
for c in newdf.columns:
    newdf[c] = newdf[c].apply(lookup)

在两个数据框之间使用“ VLOOKUP”创建新的数据框

问题描述

3 个解决方案

解决方案1
1 2017-12-28 22:50:51

解决方案2
1 已采纳 2017-12-29 01:54:31

解决方案3
0 2017-12-28 21:35:26

在两个数据框之间使用“ VLOOKUP”创建新的数据框

问题描述

3 个解决方案

解决方案1 1 2017-12-28 22:50:51

解决方案2 1 已采纳 2017-12-29 01:54:31

解决方案3 0 2017-12-28 21:35:26

解决方案1
1 2017-12-28 22:50:51

解决方案2
1 已采纳 2017-12-29 01:54:31

解决方案3
0 2017-12-28 21:35:26