简体   繁体   中英

Name Entities Replacement - Pandas Dataframe with text column - Preprocessing

I have a dataframe with a column for sentences (text). I would like to perform name entities replacement: I have a list whose elements are stocks information

stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]

I would like to find in my dataframe sentences 'symbol' and 'company' in order to replace the symbol with '<TCK>' and the company with '<CMPY>' . The function must be applied to all the rows.

I'm looking for a function which receives the dataframe with the tokenized texts and returns the processed one. It's important to match the entire company name not just an element of the name. Regarding the symbol I know it's a little more difficult because it's easy to find a 'V' (Visa symbol) in a text, however I'm here to hear some good work around

Let's make an example starting:

print(dataframe['text'])

Output:

0  [GS is the main company of Dow Jones]
1  [Once again Visa surprises all]*
2  [Johnson & Johnson's vaccine is the best one]

I would like to have a new column with the following result:

0  [<TKR> is the main company of Dow Jones]
1  [Once again <CMPY> surprises all]*
2  [<CMPY>'s vaccine is the best one] 

Row n° 1 --> Tricky one because the real name of the company is 'Visa Inc.' not just Visa... I really don't know how to deal with it.

I don't know if it is better to work with tokenized sentences: because in that case I need to tokenize also the "company" such as Goldman Sachs.

You can use

import pandas as pd
import re

stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]

def process_term(term):
    t = [x for x in term.split()]
    first = t[0]
    if len(t) > 1:
        first = first + r"(?:\s+{}{}".format("(?:\s+".join(map(re.escape,t[1:])), ")?" * (len(t)-1))
    return first

dataframe = pd.DataFrame({'text':['GS is the main company of Dow Jones', 'Once again Visa surprises all', "Johnson & Johnson's vaccine is the best one"]})
rx_symbol = r"\b(?:{})\b".format("|".join([x["symbol"] for x in stocks]))
rx_company = r"\b(?:{})(?!\w)".format("|".join(sorted([process_term(x["company"]) for x in stocks], key=len, reverse=True)))
dataframe['new_text'] = dataframe['text'].str.replace(rx_symbol, r'<TKR>', regex=True)
dataframe['new_text'] = dataframe['new_text'].str.replace(rx_company, r'<CMPY>', regex=True)
>>> dataframe
# =>                                           text                                # new_text
# => 0          GS is the main company of Dow Jones  <TKR> is the main company of Dow Jones
# => 1                Once again Visa surprises all         Once again <CMPY> surprises all
# => 2  Johnson & Johnson's vaccine is the best one        <CMPY>'s vaccine is the best one

In short:

  • Create two regexps out of symbol and company data and run two replace operations
  • The symbol regex is simple, it looks like \b(?:GS|JPM|TRV|V|AMGN|JNJ)\b and matches any alternative in the brackets as a whole word
  • The company regex follows the siffixation approach described in Regular expression to match A, AB, ABC, but not AC. ("starts with") . It looks like \b(?:The(?:\s+Travelers(?:\s+Companies)?)?|Johnson(?:\s+\&(?:\s+Johnson)?)?|JPMorgan(?:\s+Chase)?|Goldman(?:\s+Sachs)?|Visa(?:\s+Inc\.)?|Amgen)(?!\w) : each company name is re.escape d, and each subsequent word only matches (optionally) if the previous term word is matched. See the regex demo . Note the right-hand word boundary is set with (?!\w) lookahead, since \b would prevent matching if a term ends with a non-word char.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM