简体   繁体   English

python基于部分字符串匹配合并两个pandas数据帧

[英]python merge two pandas data frames based on partial string match

I'm new to Python, and I am having a lot of trouble joining two pandas data frames, because the merge should be based on a partial string match.我是 Python 新手,在连接两个 Pandas 数据框时遇到了很多麻烦,因为合并应该基于部分字符串匹配。 More specifically:更具体地说:

I have a dataframe called df that looks like this:我有一个名为df的数据df ,如下所示:

{ "writtenAt":"2015-01-01T18:31:01+00:00", "content":" India\’s banks will ramp up sales of bonds that act as capital buffers in 2015" }

where there are about 10,000 rows that looks like above.大约有 10,000 行,看起来像上面那样。

Now, I have another dataframe called compNames , which looks like this:现在,我有另一个名为compNames数据compNames ,如下所示:

{ "ticker":"A", "name":"Agilent Technologies Inc.", "keyword":"Agilent" }

I have about 500 rows for the compNames dataframe.我有大约 500 行用于compNames数据compNames

I am trying to assign a ticker value from compNames to the matching entry of df by the following mechanism:我试图通过以下机制将compNames的股票代码值分配给df的匹配条目:

  1. check if any item from the entire column compNames['keyword'] is contained in an entry of df['content']检查整个列compNames['keyword']任何项目是否包含在df['content']的条目中

  2. if there is a match, then return the matching word as a separate column of the df dataframe (eg df['matchedName'] )如果有匹配项,则将匹配的单词作为df数据帧的单独列返回(例如df['matchedName']

  3. if there are multiple matches, then create a list of matching words to the corresponding entry of df['content']如果有多个匹配项,则为df['content']的相应条目创建一个匹配词列表

  4. Finally, join df and compNames by using df['matchedName'] and compNames['keyword'] as my key variables最后,使用df['matchedName']compNames['keyword']作为我的关键变量来连接dfcompNames

What I have so far is:到目前为止我所拥有的是:

# Load select company names
compNames = pd.read_csv("compNameList_LARA.txt")
compList = '|'.join(compNames['keyword'].tolist())
df['compMatch'] = df.content.str.contains(compList)

# drop unmatched articles
df = df[df['compMatch']==True]

# assign firm names
df['matchedName'] = df['content'].apply(lambda x: [x for x in   compNames['keyword'].tolist() if x in df['content']])

However, when I do this, I get an empty list for the df['matchedName']但是,当我这样做时,我得到df['matchedName']的空列表

Could you help me figure out what went wrong?你能帮我弄清楚出了什么问题吗? Many many thanks!!非常感谢!!

-Jin -Jin

Figured it out.想通了。 I just needed to do:我只需要做:

df['content'] = df['content'].str.lower().str.split()
df['matchedName'] = df['content'].apply(lambda x: [item for item in x if item in compNames['keyword'].tolist()])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM