简体   繁体   English

在两个数据框之间使用“ VLOOKUP”创建新的数据框

[英]Create new dataframe using “VLOOKUP” between two dataframes

Somewhat similar to Excel's VLOOKUP function, I am wanting to use a value in one dataframe ( portfolios below) to find an associated value in a second dataframe ( returns below) and populate a third dataframe (let's call this dataframe3 for now) with these returned values. 与Excel的VLOOKUP函数有点类似,我想在一个数据框(下面的portfolios )中使用一个值在第二个数据框(下面的returns )中找到关联的值,并使用返回的这些数据填充第三个数据框(现在将其称为此dataframe3)值。 I have found several posts based on left merges and map , but my original two dataframes are of different structures, so these methods don't seem to fit (to me, at least). 我发现了一些基于left merges和map帖子,但是我最初的两个数据框具有不同的结构,所以这些方法似乎不合适(至少对我来说)。

I haven't made much progress, but here is the code I have: 我没有太大的进步,但是这是我的代码:

Code

import pandas as pd

portfolios = pd.read_csv('portstst5_1.csv')
returns = pd.read_csv('Example_Returns.csv')

total_cols = len(portfolios.columns)
headers = list(portfolios)

concat = returns['PERMNO'].map(str) + returns['FROMDATE'].map(str)
idx = 2
returns.insert(loc=idx, column="concat", value=concat)

for i in range(total_cols):
    col_len = portfolios.iloc[:,i].count()
    for j in range(col_len):
        print(portfolios.iat[j,i].astype('int').astype('str') + headers[i])

Data 数据

This code will make a little more sense if I first describe my data: portfolios is a dataframe with 13 columns of varying lengths. 如果我首先描述我的数据,那么这段代码会更有意义: portfolios是一个具有13个长度可变的列的数据框。 The column headers are dates in YYYYMMDD format. 列标题是YYYYMMDD格式的日期。 Below each date header are identifiers which are five digit numeric codes. 每个日期标题的下面是五位数的数字标识符。 A snippet of portfolios looks like this (some elements in some columns contain NaN): 的片段portfolios看起来像这样(在一些列的一些元素包含NAN):

    20131231  20131130  20131031  20130930  20130831  20130731  20130630  \
0    93044.0   93044.0   13264.0   13264.0   89169.0   82486.0   91274.0   
1    79702.0   91515.0   90710.0   81148.0   47387.0   88359.0   93353.0   
2    85751.0   85724.0   88810.0   11513.0   85576.0   47387.0   85576.0

The data in returns data originally consists of three columns and 799 rows and looks like this (all elements are populated with values): returns数据中的数据最初由三列和799行组成,看起来像这样(所有元素都填充有值):

     PERMNO  FROMDATE     MORET
0     93044  20131231 -0.022304
1     79702  20131231  0.012283
2     85751  20131231 -0.016453
3     85576  20131231  0.038766

Desired Output 期望的输出

I would like to make a third dataframe that is structured identically to portfolios . 我想制作第三个与portfolios相同的数据框。 That is, it will have the same column header dates and the same number of rows in each column as does portfolios , but instead of identifiers, it will contain the MORET for the appropriate identifier/date combination. 也就是说,它将与portfolios具有相同的列标题日期和每列中相同的行数,但是它将代替标识符,包含用于适当的标识符/日期组合的MORET This is the reason for the concatenations in my code above - I am trying (perhaps unnecessarily) to create unique lookup values so I can communicate between portfolios and returns . 这就是上面我的代码中进行级联的原因-我正在尝试(可能不必要)创建唯一的查找值,以便我可以在portfoliosreturns之间进行沟通。 For example, to populate dataframe3[0,0] , I would look for the concatenated values from portfolios[0,0] and headers[0] (ie 9304420131231) in returns['concat'] and return the associated value in returns['MORET'] (ie -0.022304). 例如,为了填充dataframe3[0,0]我会寻找连接值从portfolios[0,0]headers[0]中(即9304420131231) returns['concat']和返回相关的值returns['MORET'] (即-0.022304)。 I am stuck here on how to use the concatenated values to return my desired data. 我在这里停留在如何使用级联的值返回我想要的数据。

Any thoughts are greatly appreciated. 任何想法都将不胜感激。

What you are trying to do is much simpler than how you tried doing it. 您尝试做的事情比尝试做的简单得多。 You can first melt portfolios to flip it and collect all the date columns as rows in a single column, then join it with returns , and finally pivot to get the desired result. 您可以首先融化portfolios以翻转它,并将所有日期列作为一行收集在单个列中,然后将其与returns ,最后进行透视以得到所需的结果。 This is basically what @djk47463 did in one compound line, and my edited answer serves as a step-by-step breakdown of his. 这基本上是@ djk47463在一个复合行中所做的,而我编辑后的答案则是对其的逐步介绍。

Let's create your DataFrames to make the answer reproducible. 让我们创建您的DataFrame,使答案可重复。

import pandas as pd
import sys
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

# Create df
rawText = StringIO("""
     PERMNO  FROMDATE     MORET
0     93044  20131231 -0.022304
1     79702  20131231  0.012283
2     85751  20131231 -0.016453
3     85576  20131231  0.038766
4     93044  20131010 -0.02
5     79702  20131010  0.01
6     85751  20131010 -0.01
7     85576  20131010  0.03
""")
returns = pd.read_csv(rawText, sep = "\s+")
portfolios = pd.DataFrame({'20131010':[93044, 85751],
                       '20131231':[85576, 79702]})

Notice, the FROMDATE column of returns consists of numbers, but in portfolios the date columns are strings. 注意, returnsFROMDATE列由数字组成,但是在portfolios ,日期列是字符串。 We must make them consistent: 我们必须使它们一致:

df.FROMDATE = df.FROMDATE.astype(str)

Let's start the solution by melt ing (ie unpivot) portfolios : 让我们通过melt (即不可分割)的portfolios开始解决方案:

portfolios = portfolios.melt(var_name=['FROMDATE'],value_name='PERMNO')
# portfolios: 
   FROMDATE  PERMNO
0  20131010   93044
1  20131010   85751
2  20131231   85576
3  20131231   79702

Now you want to hold this pm constant, and merge returns to its lines whenever their PERMNO s and FROMDATE s match: 现在,您要保持此pm常数,并在PERMNOFROMDATE匹配时合并returns其行:

merged = pm.merge(df, how='left', on=['PERMNO', 'FROMDATE'])
# merged: 
   FROMDATE  PERMNO     MORET
0  20131010   93044 -0.020000
1  20131010   85751 -0.010000
2  20131231   85576  0.038766
3  20131231   79702  0.012283

Remember we had melt ed (unpivoted) the portfolios at the beginning? 还记得我们在一开始就melt (毫无保留的) portfolios吗? We should pivot this result to give it the shape of portfolios : 我们要pivot这个结果给它的形状portfolios

final = merged.pivot(index='PERMNO', columns='FROMDATE', values='MORET').reset_index()
# final: 
FROMDATE  PERMNO  20131010  20131231
0          79702       NaN  0.012283
1          85576       NaN  0.038766
2          85751     -0.01       NaN
3          93044     -0.02       NaN

IIUC: IIUC:

Using a combination of melt so the we can merge values from returns by desired columns. 使用melt的组合,这样我们就可以按期望的列merge来自returns的值。 Then use pivot to reshape the data back, as seen below. 然后使用数据pivot将数据重新调整为形状,如下所示。

portfolios.columns = portfolios.columns.astype(int)
newdf = portfolios.reset_index().melt(id_vars='index',var_name=['FROMDATE'],value_name='PERMNO').merge(returns,on=['FROMDATE','PERMNO'],how='left').pivot(index='index',columns='FROMDATE',values='MORET')

Which returnsthe DataFrame below 哪个返回下面的DataFrame

FROMDATE  20130630  20130731  20130831  20130930  20131031  20131130  20131231
index
0              NaN       NaN       NaN       NaN       NaN       NaN -0.022304
1              NaN       NaN       NaN       NaN       NaN       NaN  0.012283
2              NaN       NaN       NaN       NaN       NaN       NaN -0.016453

Sort columns 排序栏

newdf.loc[:,newdf.columns.sort_values(ascending=False)]

The typical way to do a vlookup in python is to create a series with what would be your left column in the index, and then slice that series by the lookup value. 在python中执行vlookup的典型方法是使用索引中的左列创建一个系列,然后通过查找值对该系列进行切片。 The NaNs complicate it a little. NaN使它复杂化了一点。 We'll make a series from returns by using the set_index method to set PERMNO as the index for the dataframe, and then slicing by the column name to isolate the MORET column as a series. 我们将使用set_index方法将PERMNO设置为数据PERMNO的索引,然后按列名称进行切片以将MORET列隔离为序列,从而从returnsset_index一系列序列。

lookupseries = returns.set_index('PERMNO')['MORET']
def lookup(x):
    try: 
        return lookupseries[x]
    except: 
        return np.nan
newdf = portfolios.copy()
for c in newdf.columns:
    newdf[c] = newdf[c].apply(lookup)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在2个数据框之间的熊猫中进行vlookup创建第三个数据框 - vlookup in pandas between 2 dataframes to create third dataframe 比较两个数据框的列并创建一个新的数据框 - Compare columns of two dataframes and create a new dataframe 根据其他两个数据框创建新的数据框 - Create new dataframe based on two other dataframes 从两个现有 DataFrame 创建一个新的 DataFrame - Create a new DataFrame from two existing DataFrames 使用两个数据帧的元素相乘创建新的 dataframe - Create new dataframe with the multiplication of the elements of two dataframes 如何创建一个新的数据框,其中包含两个现有数据框之间多列的值更改 - How to create a new dataframe that contains the value changes from multiple columns between two exisitng dataframes 使用pandas,如何比较两个数据帧中2列之间的值并将它们推送到新的数据帧? - Using pandas, how can I compare the values between 2 columns from two dataframes and push them to a new dataframe? 将两个单独的数据框之间的对应列组合成新的数据框 - Combining corresponding columns between two separate dataframes into new dataframe 计算两个数据帧之间的 cosign 距离并将结果附加到新数据帧 - Calculating the cosign distance between two Dataframes and appending result to a new dataframe Python通过有条件地检查两个单独的数据框来创建新的数据框 - Python Create new DataFrame by conditionally checking two separate dataframes
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM