[英]python merge two pandas data frames based on partial string match
I'm new to Python, and I am having a lot of trouble joining two pandas data frames, because the merge should be based on a partial string match.我是 Python 新手,在连接两个 Pandas 数据框时遇到了很多麻烦,因为合并应该基于部分字符串匹配。 More specifically:
更具体地说:
I have a dataframe called df
that looks like this:我有一个名为
df
的数据df
,如下所示:
{ "writtenAt":"2015-01-01T18:31:01+00:00", "content":" India\’s banks will ramp up sales of bonds that act as capital buffers in 2015" }
where there are about 10,000 rows that looks like above.大约有 10,000 行,看起来像上面那样。
Now, I have another dataframe called compNames
, which looks like this:现在,我有另一个名为
compNames
数据compNames
,如下所示:
{ "ticker":"A", "name":"Agilent Technologies Inc.", "keyword":"Agilent" }
I have about 500 rows for the compNames
dataframe.我有大约 500 行用于
compNames
数据compNames
。
I am trying to assign a ticker value from compNames
to the matching entry of df
by the following mechanism:我试图通过以下机制将
compNames
的股票代码值分配给df
的匹配条目:
check if any item from the entire column compNames['keyword']
is contained in an entry of df['content']
检查整个列
compNames['keyword']
任何项目是否包含在df['content']
的条目中
if there is a match, then return the matching word as a separate column of the df
dataframe (eg df['matchedName']
)如果有匹配项,则将匹配的单词作为
df
数据帧的单独列返回(例如df['matchedName']
)
if there are multiple matches, then create a list of matching words to the corresponding entry of df['content']
如果有多个匹配项,则为
df['content']
的相应条目创建一个匹配词列表
Finally, join df
and compNames
by using df['matchedName']
and compNames['keyword']
as my key variables最后,使用
df['matchedName']
和compNames['keyword']
作为我的关键变量来连接df
和compNames
What I have so far is:到目前为止我所拥有的是:
# Load select company names
compNames = pd.read_csv("compNameList_LARA.txt")
compList = '|'.join(compNames['keyword'].tolist())
df['compMatch'] = df.content.str.contains(compList)
# drop unmatched articles
df = df[df['compMatch']==True]
# assign firm names
df['matchedName'] = df['content'].apply(lambda x: [x for x in compNames['keyword'].tolist() if x in df['content']])
However, when I do this, I get an empty list for the df['matchedName']
但是,当我这样做时,我得到
df['matchedName']
的空列表
Could you help me figure out what went wrong?你能帮我弄清楚出了什么问题吗? Many many thanks!!
非常感谢!!
-Jin -Jin
Figured it out.想通了。 I just needed to do:
我只需要做:
df['content'] = df['content'].str.lower().str.split()
df['matchedName'] = df['content'].apply(lambda x: [item for item in x if item in compNames['keyword'].tolist()])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.