[英]Name Entities Replacement - Pandas Dataframe with text column - Preprocessing
我有一个 dataframe 带有一列句子(文本)。 我想执行名称实体替换:我有一个列表,其元素是股票信息
stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]
我想在我的 dataframe 句子中找到 'symbol' 和 'company' ,以便将符号替换为'<TCK>'
并将公司替换为'<CMPY>'
。 function 必须应用于所有行。
我正在寻找一个 function 接收带有标记化文本的 dataframe 并返回处理后的文本。 重要的是要匹配整个公司名称,而不仅仅是名称的一个元素。 关于符号,我知道这有点困难,因为在文本中很容易找到“V”(签证符号),但是我来这里是为了听到一些好的解决方法
让我们举一个例子开始:
print(dataframe['text'])
Output:
0 [GS is the main company of Dow Jones]
1 [Once again Visa surprises all]*
2 [Johnson & Johnson's vaccine is the best one]
我想要一个具有以下结果的新列:
0 [<TKR> is the main company of Dow Jones]
1 [Once again <CMPY> surprises all]*
2 [<CMPY>'s vaccine is the best one]
第 1 行 --> 棘手的第一行,因为公司的真实名称是“Visa Inc.”。 不只是签证...我真的不知道如何处理它。
我不知道使用标记化的句子是否更好:因为在这种情况下,我还需要标记高盛等“公司”。
您可以使用
import pandas as pd
import re
stocks = [
{"symbol": "GS", "company": "Goldman Sachs", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "JPM", "company": "JPMorgan Chase", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "TRV", "company": "The Travelers Companies", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "V", "company": "Visa Inc.", "index": "Dow Jones", "sector": "Financial Services"},
{"symbol": "AMGN", "company": "Amgen", "index": "Dow Jones", "sector": "Healthcare"},
{"symbol": "JNJ", "company": "Johnson & Johnson", "index": "Dow Jones", "sector": "Healthcare"}
]
def process_term(term):
t = [x for x in term.split()]
first = t[0]
if len(t) > 1:
first = first + r"(?:\s+{}{}".format("(?:\s+".join(map(re.escape,t[1:])), ")?" * (len(t)-1))
return first
dataframe = pd.DataFrame({'text':['GS is the main company of Dow Jones', 'Once again Visa surprises all', "Johnson & Johnson's vaccine is the best one"]})
rx_symbol = r"\b(?:{})\b".format("|".join([x["symbol"] for x in stocks]))
rx_company = r"\b(?:{})(?!\w)".format("|".join(sorted([process_term(x["company"]) for x in stocks], key=len, reverse=True)))
dataframe['new_text'] = dataframe['text'].str.replace(rx_symbol, r'<TKR>', regex=True)
dataframe['new_text'] = dataframe['new_text'].str.replace(rx_company, r'<CMPY>', regex=True)
>>> dataframe
# => text # new_text
# => 0 GS is the main company of Dow Jones <TKR> is the main company of Dow Jones
# => 1 Once again Visa surprises all Once again <CMPY> surprises all
# => 2 Johnson & Johnson's vaccine is the best one <CMPY>'s vaccine is the best one
简而言之:
symbol
和company
数据中创建两个正则表达式并运行两个replace
操作symbol
正则表达式很简单,它看起来像\b(?:GS|JPM|TRV|V|AMGN|JNJ)\b
并匹配括号中的任何替代作为一个整体company
正则表达式遵循正则表达式中描述的后缀方法来匹配 A、AB、ABC,但不匹配 AC。 (“开始于”) 。 它看起来像\b(?:The(?:\s+Travelers(?:\s+Companies)?)?|Johnson(?:\s+\&(?:\s+Johnson)?)?|JPMorgan(?:\s+Chase)?|Goldman(?:\s+Sachs)?|Visa(?:\s+Inc\.)?|Amgen)(?!\w)
: 每个公司名称都是re.escape
d ,并且每个后续词仅在匹配前一个词词时才匹配(可选)。 请参阅正则表达式演示。 请注意,右侧单词边界设置为(?!\w)
前瞻,因为如果术语以非单词字符结尾, \b
将阻止匹配。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.