繁体   English   中英

名称实体替换 - Pandas Dataframe 与文本列 - 预处理

[英]Name Entities Replacement - Pandas Dataframe with text column - Preprocessing

我有一个 dataframe 带有一列句子(文本)。 我想执行名称实体替换:我有一个列表,其元素是股票信息

stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]

我想在我的 dataframe 句子中找到 'symbol' 和 'company' ,以便将符号替换为'<TCK>'并将公司替换为'<CMPY>' function 必须应用于所有行。

我正在寻找一个 function 接收带有标记化文本的 dataframe 并返回处理后的文本。 重要的是要匹配整个公司名称,而不仅仅是名称的一个元素。 关于符号,我知道这有点困难,因为在文本中很容易找到“V”(签证符号),但是我来这里是为了听到一些好的解决方法

让我们举一个例子开始:

print(dataframe['text'])

Output:

0  [GS is the main company of Dow Jones]
1  [Once again Visa surprises all]*
2  [Johnson & Johnson's vaccine is the best one]

我想要一个具有以下结果的新列:

0  [<TKR> is the main company of Dow Jones]
1  [Once again <CMPY> surprises all]*
2  [<CMPY>'s vaccine is the best one] 

第 1 行 --> 棘手的第一行,因为公司的真实名称是“Visa Inc.”。 不只是签证...我真的不知道如何处理它。

我不知道使用标记化的句子是否更好:因为在这种情况下,我还需要标记高盛等“公司”。

您可以使用

import pandas as pd
import re

stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]

def process_term(term):
    t = [x for x in term.split()]
    first = t[0]
    if len(t) > 1:
        first = first + r"(?:\s+{}{}".format("(?:\s+".join(map(re.escape,t[1:])), ")?" * (len(t)-1))
    return first

dataframe = pd.DataFrame({'text':['GS is the main company of Dow Jones', 'Once again Visa surprises all', "Johnson & Johnson's vaccine is the best one"]})
rx_symbol = r"\b(?:{})\b".format("|".join([x["symbol"] for x in stocks]))
rx_company = r"\b(?:{})(?!\w)".format("|".join(sorted([process_term(x["company"]) for x in stocks], key=len, reverse=True)))
dataframe['new_text'] = dataframe['text'].str.replace(rx_symbol, r'<TKR>', regex=True)
dataframe['new_text'] = dataframe['new_text'].str.replace(rx_company, r'<CMPY>', regex=True)
>>> dataframe
# =>                                           text                                # new_text
# => 0          GS is the main company of Dow Jones  <TKR> is the main company of Dow Jones
# => 1                Once again Visa surprises all         Once again <CMPY> surprises all
# => 2  Johnson & Johnson's vaccine is the best one        <CMPY>'s vaccine is the best one

简而言之:

  • symbolcompany数据中创建两个正则表达式并运行两个replace操作
  • symbol正则表达式很简单,它看起来像\b(?:GS|JPM|TRV|V|AMGN|JNJ)\b并匹配括号中的任何替代作为一个整体
  • company正则表达式遵循正则表达式中描述的后缀方法来匹配 A、AB、ABC,但不匹配 AC。 (“开始于”) 它看起来像\b(?:The(?:\s+Travelers(?:\s+Companies)?)?|Johnson(?:\s+\&(?:\s+Johnson)?)?|JPMorgan(?:\s+Chase)?|Goldman(?:\s+Sachs)?|Visa(?:\s+Inc\.)?|Amgen)(?!\w) : 每个公司名称都是re.escape d ,并且每个后续词仅在匹配前一个词词时才匹配(可选)。 请参阅正则表达式演示 请注意,右侧单词边界设置为(?!\w)前瞻,因为如果术语以非单词字符结尾, \b将阻止匹配。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM