[英]Extract entities from spacy
I have a python file for webscraping : scrapper.py::我有一个用于网页抓取的 python 文件:scrapper.py::
from bs4 import BeautifulSoup
import requests
source = requests.get('https://en.wikipedia.org/wiki/Willis').text
soup = BeautifulSoup(source,'lxml')
def my_function():
heading = soup.find('h1',{'id':'firstHeading'}).text
print(heading)
print()
for item in soup.select("#mw-content-text"):
required_data = [p_item.text for p_item in item.select("p")][1:3]
print('\n'.join(required_data).encode('utf-8'))
Willis= soup.find("caption",{"class":"fn org"}).text
print(Willis)
print()
I want to use spacy to extract entities from scrapper.py :: pyspacy.py我想使用 spacy 从 scrapper.py 中提取实体 :: pyspacy.py
import spacy
import scrapper
entity_list = []
nlp = spacy.load("en_core_web_sm")
doc = nlp(scrapper.my_function())
for entity in doc.ents:
entity_list.append((entity.text, entity.label_))
print(entity_list)
It just gives me the output:: in terminal for the scraped data along with error::它只是给了我输出 :: 在终端中用于抓取的数据以及错误 ::
**
Traceback (most recent call last):
File "hakuna_spacy.py", line 12, in <module>
doc = nlp(printwo.pubb())
File "C:\Users\Hp\AppData\Local\Programs\Python\Python37\lib\site-packages\spacy\language.py",
line 423, in __call__
if len(text) > self.max_length:
TypeError: object of type 'NoneType' has no len()
**
What is that I'm doing wrong?我做错了什么? Can someone explain me please?
有人可以解释一下吗?
In your initial code snippet, you had the problem that pubb
prints text to stdout
but does not return a value.在您的初始代码片段中,您遇到了
pubb
文本打印到stdout
但不返回值的问题。 You would try instead:你会尝试:
def pubb():
return 'hello, world'
[Edit]: [编辑]:
In the edited version, there are some other issues I can see.在编辑后的版本中,我可以看到其他一些问题。
The fetch works, so:获取工作,所以:
>>> source = requests.get('https://en.wikipedia.org/wiki/Willis').text
>>> len(source)
36836
bs4 correctly finds the heading too: bs4 也正确地找到了标题:
>>> soup = BeautifulSoup(source,'lxml')
>>> soup.find('h1',{'id':'firstHeading'}).text
'Willis'
bs4 also finds an item in the content section (just 1): bs4 还在内容部分找到了一个项目(只有 1 个):
>>> len(soup.select("#mw-content-text"))
1
The trouble then is that it doesn't find any content per se:问题是它本身没有找到任何内容:
>>> soup.select("#mw-content-text")[0].select("p")[1:3]
[]
And it doesn't find the caption:它没有找到标题:
>>> soup.find("caption",{"class":"fn org"})
>>>
You also have the pre-existing issue that you are not returning any text from my_function
, so the wrapper that passes the return value of that function into the spacy
call is passed None
which gives you the exception.您还有一个预先存在的问题,即您没有从
my_function
返回任何文本,因此将该函数的返回值传递到spacy
调用的包装器被传递None
这给了您异常。 What do you want my_function
to return?你希望
my_function
返回什么?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.