简体   繁体   English

在 dataframe 中显示 NER 空间数据

[英]Show NER Spacy Data in dataframe

I am doing some web scraping to export text info from an html and using a NER (Spacy) to identify information such as Assets Under Management, Addresses, and founding dates of companies.我正在做一些 web 抓取以从 html 导出文本信息,并使用 NER (Spacy) 来识别诸如管理下的资产、地址和公司成立日期等信息。 Once the information is extracted, I would like to place it in a dataframe.提取信息后,我想将其放在 dataframe 中。

I am working with the following script:我正在使用以下脚本:

from bs4 import BeautifulSoup
import numpy as np
from time import sleep
from random import randint
from selenium import webdriver
import pandas as pd
import spacy
from spacy import displacy
import en_core_web_sm
import requests
import re

NER = spacy.load("en_core_web_sm")

url = "https://www.baincapital.com/"


driver = webdriver.Chrome("C:/Program Files/chromedriver.exe")
driver.get(url)  
sleep(randint(5,15))
soup = BeautifulSoup(driver.page_source, 'html.parser')
body=soup.body.text
body
body= body.replace('\n', ' ')
body= body.replace('\t', ' ')
body= body.replace('\r', ' ')
body= body.replace('\xa0', ' ')
text3= NER(body)
displacy.render(text3,style="ent",jupyter=True)

The output is shown as: output如图所示:

空间提取

And I would like to place it in the following rudimentary table:我想把它放在下面的基本表格中:

Entity实体 Identified确定
Money $155 Billion 1550 亿美元
Date日期 1984 1984年
Org组织 Bain Capital贝恩资本
Org组织 Bain Capital Investor Portal Please贝恩资本投资者门户请
Cardinal红衣主教 four
Cardinal红衣主教 24 24
GPE GPE US我们

Essentially, take highlighted info and place it in a dataframe with identifying features.本质上,获取突出显示的信息并将其放置在具有识别功能的 dataframe 中。

After you obtained the body with plain text, you can parse the text into a document and get a list of all entities with their labels and texts, and then instantiate a Pandas dataframe with those data:获得纯文本body后,您可以将文本解析为文档并获取所有实体及其标签和文本的列表,然后使用这些数据实例化 Pandas dataframe:

#... your code here ...
body=soup.body.text

# now, this is the modification:
body = ' '.join(body.split())
doc = NER(body)
entities = [(e.label_,e.text) for e in doc.ents]
df = pd.DataFrame(entities, columns=['Entity','Identified'])

Note that the body = ' '.join(body.split()) line is used to normalize all whitespace in a simpler and shorter way than you used.请注意, body = ' '.join(body.split())行用于以比您使用的更简单和更短的方式标准化所有空白。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM