简体   繁体   English

从 spacy 中提取实体

[英]Extract entities from spacy

I have a python file for webscraping : scrapper.py::我有一个用于网页抓取的 python 文件:scrapper.py::

from bs4 import BeautifulSoup
import requests
source = requests.get('https://en.wikipedia.org/wiki/Willis').text
soup = BeautifulSoup(source,'lxml')

def my_function():

    heading = soup.find('h1',{'id':'firstHeading'}).text
    print(heading)
    print()
for item in soup.select("#mw-content-text"):
        required_data = [p_item.text for p_item in item.select("p")][1:3]
        print('\n'.join(required_data).encode('utf-8'))

    Willis= soup.find("caption",{"class":"fn org"}).text
    print(Willis)
    print()

I want to use spacy to extract entities from scrapper.py :: pyspacy.py我想使用 spacy 从 scrapper.py 中提取实体 :: pyspacy.py

import spacy
import scrapper

entity_list = []

nlp = spacy.load("en_core_web_sm")


doc = nlp(scrapper.my_function())

for entity in doc.ents:
    entity_list.append((entity.text, entity.label_))
print(entity_list)

It just gives me the output:: in terminal for the scraped data along with error::它只是给了我输出 :: 在终端中用于抓取的数据以及错误 ::

** 
Traceback (most recent call last):
  File "hakuna_spacy.py", line 12, in <module>
    doc = nlp(printwo.pubb())
  File "C:\Users\Hp\AppData\Local\Programs\Python\Python37\lib\site-packages\spacy\language.py", 
line 423, in __call__
    if len(text) > self.max_length:

TypeError: object of type 'NoneType' has no len()

 **

What is that I'm doing wrong?我做错了什么? Can someone explain me please?有人可以解释一下吗?

In your initial code snippet, you had the problem that pubb prints text to stdout but does not return a value.在您的初始代码片段中,您遇到了pubb文本打印到stdout但不返回值的问题。 You would try instead:你会尝试:

def pubb():
    return 'hello, world'

[Edit]: [编辑]:

In the edited version, there are some other issues I can see.在编辑后的版本中,我可以看到其他一些问题。

The fetch works, so:获取工作,所以:

>>> source = requests.get('https://en.wikipedia.org/wiki/Willis').text
>>> len(source)
36836

bs4 correctly finds the heading too: bs4 也正确地找到了标题:

>>> soup = BeautifulSoup(source,'lxml')
>>> soup.find('h1',{'id':'firstHeading'}).text
'Willis'

bs4 also finds an item in the content section (just 1): bs4 还在内容部分找到了一个项目(只有 1 个):

>>> len(soup.select("#mw-content-text"))
1

The trouble then is that it doesn't find any content per se:问题是它本身没有找到任何内容:

>>> soup.select("#mw-content-text")[0].select("p")[1:3]
[]

And it doesn't find the caption:它没有找到标题:

>>> soup.find("caption",{"class":"fn org"})                                                                                                                                                                   
>>>

You also have the pre-existing issue that you are not returning any text from my_function , so the wrapper that passes the return value of that function into the spacy call is passed None which gives you the exception.您还有一个预先存在的问题,即您没有从my_function返回任何文本,因此将该函数的返回值传递到spacy调用的包装器被传递None这给了您异常。 What do you want my_function to return?你希望my_function返回什么?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM