[英]Create new dataframe using “VLOOKUP” between two dataframes
Somewhat similar to Excel's VLOOKUP function, I am wanting to use a value in one dataframe ( portfolios
below) to find an associated value in a second dataframe ( returns
below) and populate a third dataframe (let's call this dataframe3 for now) with these returned values. 与Excel的VLOOKUP函数有点类似,我想在一个数据框(下面的portfolios
)中使用一个值在第二个数据框(下面的returns
)中找到关联的值,并使用返回的这些数据填充第三个数据框(现在将其称为此dataframe3)值。 I have found several posts based on left merges and map
, but my original two dataframes are of different structures, so these methods don't seem to fit (to me, at least). 我发现了一些基于left merges和map
帖子,但是我最初的两个数据框具有不同的结构,所以这些方法似乎不合适(至少对我来说)。
I haven't made much progress, but here is the code I have: 我没有太大的进步,但是这是我的代码:
Code 码
import pandas as pd
portfolios = pd.read_csv('portstst5_1.csv')
returns = pd.read_csv('Example_Returns.csv')
total_cols = len(portfolios.columns)
headers = list(portfolios)
concat = returns['PERMNO'].map(str) + returns['FROMDATE'].map(str)
idx = 2
returns.insert(loc=idx, column="concat", value=concat)
for i in range(total_cols):
col_len = portfolios.iloc[:,i].count()
for j in range(col_len):
print(portfolios.iat[j,i].astype('int').astype('str') + headers[i])
Data 数据
This code will make a little more sense if I first describe my data: portfolios
is a dataframe with 13 columns of varying lengths. 如果我首先描述我的数据,那么这段代码会更有意义: portfolios
是一个具有13个长度可变的列的数据框。 The column headers are dates in YYYYMMDD format. 列标题是YYYYMMDD格式的日期。 Below each date header are identifiers which are five digit numeric codes. 每个日期标题的下面是五位数的数字标识符。 A snippet of portfolios
looks like this (some elements in some columns contain NaN): 的片段portfolios
看起来像这样(在一些列的一些元素包含NAN):
20131231 20131130 20131031 20130930 20130831 20130731 20130630 \
0 93044.0 93044.0 13264.0 13264.0 89169.0 82486.0 91274.0
1 79702.0 91515.0 90710.0 81148.0 47387.0 88359.0 93353.0
2 85751.0 85724.0 88810.0 11513.0 85576.0 47387.0 85576.0
The data in returns
data originally consists of three columns and 799 rows and looks like this (all elements are populated with values): returns
数据中的数据最初由三列和799行组成,看起来像这样(所有元素都填充有值):
PERMNO FROMDATE MORET
0 93044 20131231 -0.022304
1 79702 20131231 0.012283
2 85751 20131231 -0.016453
3 85576 20131231 0.038766
Desired Output 期望的输出
I would like to make a third dataframe that is structured identically to portfolios
. 我想制作第三个与portfolios
相同的数据框。 That is, it will have the same column header dates and the same number of rows in each column as does portfolios
, but instead of identifiers, it will contain the MORET
for the appropriate identifier/date combination. 也就是说,它将与portfolios
具有相同的列标题日期和每列中相同的行数,但是它将代替标识符,包含用于适当的标识符/日期组合的MORET
。 This is the reason for the concatenations in my code above - I am trying (perhaps unnecessarily) to create unique lookup values so I can communicate between portfolios
and returns
. 这就是上面我的代码中进行级联的原因-我正在尝试(可能不必要)创建唯一的查找值,以便我可以在portfolios
和returns
之间进行沟通。 For example, to populate dataframe3[0,0]
, I would look for the concatenated values from portfolios[0,0]
and headers[0]
(ie 9304420131231) in returns['concat']
and return the associated value in returns['MORET']
(ie -0.022304). 例如,为了填充dataframe3[0,0]
我会寻找连接值从portfolios[0,0]
和headers[0]
中(即9304420131231) returns['concat']
和返回相关的值returns['MORET']
(即-0.022304)。 I am stuck here on how to use the concatenated values to return my desired data. 我在这里停留在如何使用级联的值返回我想要的数据。
Any thoughts are greatly appreciated. 任何想法都将不胜感激。
What you are trying to do is much simpler than how you tried doing it. 您尝试做的事情比尝试做的要简单得多。 You can first melt portfolios
to flip it and collect all the date columns as rows in a single column, then join it with returns
, and finally pivot to get the desired result. 您可以首先融化portfolios
以翻转它,并将所有日期列作为一行收集在单个列中,然后将其与returns
,最后进行透视以得到所需的结果。 This is basically what @djk47463 did in one compound line, and my edited answer serves as a step-by-step breakdown of his. 这基本上是@ djk47463在一个复合行中所做的,而我编辑后的答案则是对其的逐步介绍。
Let's create your DataFrames to make the answer reproducible. 让我们创建您的DataFrame,使答案可重复。
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
# Create df
rawText = StringIO("""
PERMNO FROMDATE MORET
0 93044 20131231 -0.022304
1 79702 20131231 0.012283
2 85751 20131231 -0.016453
3 85576 20131231 0.038766
4 93044 20131010 -0.02
5 79702 20131010 0.01
6 85751 20131010 -0.01
7 85576 20131010 0.03
""")
returns = pd.read_csv(rawText, sep = "\s+")
portfolios = pd.DataFrame({'20131010':[93044, 85751],
'20131231':[85576, 79702]})
Notice, the FROMDATE
column of returns
consists of numbers, but in portfolios
the date columns are strings. 注意, returns
的FROMDATE
列由数字组成,但是在portfolios
,日期列是字符串。 We must make them consistent: 我们必须使它们一致:
df.FROMDATE = df.FROMDATE.astype(str)
Let's start the solution by melt
ing (ie unpivot) portfolios
: 让我们通过melt
(即不可分割)的portfolios
开始解决方案:
portfolios = portfolios.melt(var_name=['FROMDATE'],value_name='PERMNO')
# portfolios:
FROMDATE PERMNO
0 20131010 93044
1 20131010 85751
2 20131231 85576
3 20131231 79702
Now you want to hold this pm
constant, and merge returns
to its lines whenever their PERMNO
s and FROMDATE
s match: 现在,您要保持此pm
常数,并在PERMNO
和FROMDATE
匹配时合并returns
其行:
merged = pm.merge(df, how='left', on=['PERMNO', 'FROMDATE'])
# merged:
FROMDATE PERMNO MORET
0 20131010 93044 -0.020000
1 20131010 85751 -0.010000
2 20131231 85576 0.038766
3 20131231 79702 0.012283
Remember we had melt
ed (unpivoted) the portfolios
at the beginning? 还记得我们在一开始就melt
(毫无保留的) portfolios
吗? We should pivot
this result to give it the shape of portfolios
: 我们要pivot
这个结果给它的形状portfolios
:
final = merged.pivot(index='PERMNO', columns='FROMDATE', values='MORET').reset_index()
# final:
FROMDATE PERMNO 20131010 20131231
0 79702 NaN 0.012283
1 85576 NaN 0.038766
2 85751 -0.01 NaN
3 93044 -0.02 NaN
IIUC: IIUC:
Using a combination of melt
so the we can merge
values from returns
by desired columns. 使用melt
的组合,这样我们就可以按期望的列merge
来自returns
的值。 Then use pivot
to reshape the data back, as seen below. 然后使用数据pivot
将数据重新调整为形状,如下所示。
portfolios.columns = portfolios.columns.astype(int)
newdf = portfolios.reset_index().melt(id_vars='index',var_name=['FROMDATE'],value_name='PERMNO').merge(returns,on=['FROMDATE','PERMNO'],how='left').pivot(index='index',columns='FROMDATE',values='MORET')
Which returnsthe DataFrame below 哪个返回下面的DataFrame
FROMDATE 20130630 20130731 20130831 20130930 20131031 20131130 20131231
index
0 NaN NaN NaN NaN NaN NaN -0.022304
1 NaN NaN NaN NaN NaN NaN 0.012283
2 NaN NaN NaN NaN NaN NaN -0.016453
Sort columns 排序栏
newdf.loc[:,newdf.columns.sort_values(ascending=False)]
The typical way to do a vlookup in python is to create a series with what would be your left column in the index, and then slice that series by the lookup value. 在python中执行vlookup的典型方法是使用索引中的左列创建一个系列,然后通过查找值对该系列进行切片。 The NaNs complicate it a little. NaN使它复杂化了一点。 We'll make a series from returns
by using the set_index
method to set PERMNO
as the index for the dataframe, and then slicing by the column name to isolate the MORET
column as a series. 我们将使用set_index
方法将PERMNO
设置为数据PERMNO
的索引,然后按列名称进行切片以将MORET
列隔离为序列,从而从returns
中set_index
一系列序列。
lookupseries = returns.set_index('PERMNO')['MORET']
def lookup(x):
try:
return lookupseries[x]
except:
return np.nan
newdf = portfolios.copy()
for c in newdf.columns:
newdf[c] = newdf[c].apply(lookup)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.