[英]How to extract Named Entities from Pandas DataFrame using SpaCy
I am trying to extract Named Entities using first answer to this question and code is as following我正在尝试使用此问题的第一个答案来提取命名实体,代码如下
for i in df['Article'].to_list():
doc = nlp(i)
for entity in doc.ents:
print((entity.text))
But it is not printing entities.但它不是打印实体。 I have tried
print(i)
and print(doc)
both variables have values and df['Article']
contains news text.我试过
print(i)
和print(doc)
两个变量都有值,并且df['Article']
包含新闻文本。 Can someone help with why second loop is not extracting entities?有人可以帮助解释为什么第二个循环不提取实体吗? Thank you
谢谢
EDIT:编辑:
This is dataset file, please run following code to form preprocessing that I have done.这是数据集文件,请运行以下代码以形成我所做的预处理。
df.iloc[:,0].dropna(inplace=True)
df = df[df.iloc[:,0].notna()]
to remove special characters from df['Articles']
从
df['Articles']
中删除特殊字符
df['Article'] = df['Article'].map(lambda x: re.sub(r'\W+', '', x))
With df['Article'].map(lambda x: re.sub(r'\W+', '', x))
, you remove all whitespace chars from your articles.使用
df['Article'].map(lambda x: re.sub(r'\W+', '', x))
,您可以从文章中删除所有空白字符。
You need to use你需要使用
df['Article'] = df['Article'].str.replace(r'(?:_|[^\w\s])+', '')
With that regex, you will only remove special chars other than whitespaces.使用该正则表达式,您只会删除除空格以外的特殊字符。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.