名称实体替换 - Pandas Dataframe 与文本列 - 预处理

Question

我有一个 dataframe 带有一列句子（文本）。 我想执行名称实体替换：我有一个列表，其元素是股票信息

stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]

我想在我的 dataframe 句子中找到 'symbol' 和 'company' ，以便将符号替换为'<TCK>'并将公司替换为'<CMPY>' 。 function 必须应用于所有行。

我正在寻找一个 function 接收带有标记化文本的 dataframe 并返回处理后的文本。 重要的是要匹配整个公司名称，而不仅仅是名称的一个元素。 关于符号，我知道这有点困难，因为在文本中很容易找到“V”（签证符号），但是我来这里是为了听到一些好的解决方法

让我们举一个例子开始：

print(dataframe['text'])

Output：

0  [GS is the main company of Dow Jones]
1  [Once again Visa surprises all]*
2  [Johnson & Johnson's vaccine is the best one]

我想要一个具有以下结果的新列：

0  [<TKR> is the main company of Dow Jones]
1  [Once again <CMPY> surprises all]*
2  [<CMPY>'s vaccine is the best one]

第 1 行 --> 棘手的第一行，因为公司的真实名称是“Visa Inc.”。 不只是签证...我真的不知道如何处理它。

我不知道使用标记化的句子是否更好：因为在这种情况下，我还需要标记高盛等“公司”。

Answer 1

您可以使用

import pandas as pd
import re

stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]

def process_term(term):
    t = [x for x in term.split()]
    first = t[0]
    if len(t) > 1:
        first = first + r"(?:\s+{}{}".format("(?:\s+".join(map(re.escape,t[1:])), ")?" * (len(t)-1))
    return first

dataframe = pd.DataFrame({'text':['GS is the main company of Dow Jones', 'Once again Visa surprises all', "Johnson & Johnson's vaccine is the best one"]})
rx_symbol = r"\b(?:{})\b".format("|".join([x["symbol"] for x in stocks]))
rx_company = r"\b(?:{})(?!\w)".format("|".join(sorted([process_term(x["company"]) for x in stocks], key=len, reverse=True)))
dataframe['new_text'] = dataframe['text'].str.replace(rx_symbol, r'<TKR>', regex=True)
dataframe['new_text'] = dataframe['new_text'].str.replace(rx_company, r'<CMPY>', regex=True)
>>> dataframe
# =>                                           text                                # new_text
# => 0          GS is the main company of Dow Jones  <TKR> is the main company of Dow Jones
# => 1                Once again Visa surprises all         Once again <CMPY> surprises all
# => 2  Johnson & Johnson's vaccine is the best one        <CMPY>'s vaccine is the best one

简而言之：

从symbol和company数据中创建两个正则表达式并运行两个replace操作
symbol正则表达式很简单，它看起来像\b(?:GS|JPM|TRV|V|AMGN|JNJ)\b并匹配括号中的任何替代作为一个整体
company正则表达式遵循正则表达式中描述的后缀方法来匹配 A、AB、ABC，但不匹配 AC。 （“开始于”）。 它看起来像\b(?:The(?:\s+Travelers(?:\s+Companies)?)?|Johnson(?:\s+\&(?:\s+Johnson)?)?|JPMorgan(?:\s+Chase)?|Goldman(?:\s+Sachs)?|Visa(?:\s+Inc\.)?|Amgen)(?!\w) : 每个公司名称都是re.escape d ，并且每个后续词仅在匹配前一个词词时才匹配（可选）。 请参阅正则表达式演示。 请注意，右侧单词边界设置为(?!\w)前瞻，因为如果术语以非单词字符结尾， \b将阻止匹配。

名称实体替换 - Pandas Dataframe 与文本列 - 预处理

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-03-28 14:01:11

名称实体替换 - Pandas Dataframe 与文本列 - 预处理

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-03-28 14:01:11

解决方案1
1 已采纳 2021-03-28 14:01:11