简体   繁体   中英

python merge two pandas data frames based on partial string match

I'm new to Python, and I am having a lot of trouble joining two pandas data frames, because the merge should be based on a partial string match. More specifically:

I have a dataframe called df that looks like this:

{ "writtenAt":"2015-01-01T18:31:01+00:00", "content":" India\’s banks will ramp up sales of bonds that act as capital buffers in 2015" }

where there are about 10,000 rows that looks like above.

Now, I have another dataframe called compNames , which looks like this:

{ "ticker":"A", "name":"Agilent Technologies Inc.", "keyword":"Agilent" }

I have about 500 rows for the compNames dataframe.

I am trying to assign a ticker value from compNames to the matching entry of df by the following mechanism:

  1. check if any item from the entire column compNames['keyword'] is contained in an entry of df['content']

  2. if there is a match, then return the matching word as a separate column of the df dataframe (eg df['matchedName'] )

  3. if there are multiple matches, then create a list of matching words to the corresponding entry of df['content']

  4. Finally, join df and compNames by using df['matchedName'] and compNames['keyword'] as my key variables

What I have so far is:

# Load select company names
compNames = pd.read_csv("compNameList_LARA.txt")
compList = '|'.join(compNames['keyword'].tolist())
df['compMatch'] = df.content.str.contains(compList)

# drop unmatched articles
df = df[df['compMatch']==True]

# assign firm names
df['matchedName'] = df['content'].apply(lambda x: [x for x in   compNames['keyword'].tolist() if x in df['content']])

However, when I do this, I get an empty list for the df['matchedName']

Could you help me figure out what went wrong? Many many thanks!!

-Jin

Figured it out. I just needed to do:

df['content'] = df['content'].str.lower().str.split()
df['matchedName'] = df['content'].apply(lambda x: [item for item in x if item in compNames['keyword'].tolist()])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM