python基于部分字符串匹配合并两个pandas数据帧

Question

I'm new to Python, and I am having a lot of trouble joining two pandas data frames, because the merge should be based on a partial string match.我是 Python 新手，在连接两个 Pandas 数据框时遇到了很多麻烦，因为合并应该基于部分字符串匹配。 More specifically:更具体地说：

I have a dataframe called df that looks like this:我有一个名为df的数据df ，如下所示：

{ "writtenAt":"2015-01-01T18:31:01+00:00", "content":" India\’s banks will ramp up sales of bonds that act as capital buffers in 2015" }

where there are about 10,000 rows that looks like above.大约有 10,000 行，看起来像上面那样。

Now, I have another dataframe called compNames , which looks like this:现在，我有另一个名为compNames数据compNames ，如下所示：

{ "ticker":"A", "name":"Agilent Technologies Inc.", "keyword":"Agilent" }

I have about 500 rows for the compNames dataframe.我有大约 500 行用于compNames数据compNames 。

I am trying to assign a ticker value from compNames to the matching entry of df by the following mechanism:我试图通过以下机制将compNames的股票代码值分配给df的匹配条目：

check if any item from the entire column compNames['keyword'] is contained in an entry of df['content']检查整个列compNames['keyword']任何项目是否包含在df['content']的条目中
if there is a match, then return the matching word as a separate column of the df dataframe (eg df['matchedName'] )如果有匹配项，则将匹配的单词作为df数据帧的单独列返回（例如df['matchedName'] ）
if there are multiple matches, then create a list of matching words to the corresponding entry of df['content']如果有多个匹配项，则为df['content']的相应条目创建一个匹配词列表
Finally, join df and compNames by using df['matchedName'] and compNames['keyword'] as my key variables最后，使用df['matchedName']和compNames['keyword']作为我的关键变量来连接df和compNames

What I have so far is:到目前为止我所拥有的是：

# Load select company names
compNames = pd.read_csv("compNameList_LARA.txt")
compList = '|'.join(compNames['keyword'].tolist())
df['compMatch'] = df.content.str.contains(compList)

# drop unmatched articles
df = df[df['compMatch']==True]

# assign firm names
df['matchedName'] = df['content'].apply(lambda x: [x for x in   compNames['keyword'].tolist() if x in df['content']])

However, when I do this, I get an empty list for the df['matchedName']但是，当我这样做时，我得到df['matchedName']的空列表

Could you help me figure out what went wrong?你能帮我弄清楚出了什么问题吗？ Many many thanks!!非常感谢！！

-Jin -Jin

Answer 1

Figured it out.想通了。 I just needed to do:我只需要做：

df['content'] = df['content'].str.lower().str.split()
df['matchedName'] = df['content'].apply(lambda x: [item for item in x if item in compNames['keyword'].tolist()])

python基于部分字符串匹配合并两个pandas数据帧

问题描述

1 个解决方案

解决方案1
6 2016-10-30 17:52:12

python基于部分字符串匹配合并两个pandas数据帧

问题描述

1 个解决方案

解决方案1 6 2016-10-30 17:52:12

解决方案1
6 2016-10-30 17:52:12