从 spacy 中提取实体

Question

I have a python file for webscraping : scrapper.py::我有一个用于网页抓取的 python 文件：scrapper.py::

from bs4 import BeautifulSoup
import requests
source = requests.get('https://en.wikipedia.org/wiki/Willis').text
soup = BeautifulSoup(source,'lxml')

def my_function():

    heading = soup.find('h1',{'id':'firstHeading'}).text
    print(heading)
    print()
for item in soup.select("#mw-content-text"):
        required_data = [p_item.text for p_item in item.select("p")][1:3]
        print('\n'.join(required_data).encode('utf-8'))

    Willis= soup.find("caption",{"class":"fn org"}).text
    print(Willis)
    print()

I want to use spacy to extract entities from scrapper.py :: pyspacy.py我想使用 spacy 从 scrapper.py 中提取实体 :: pyspacy.py

import spacy
import scrapper

entity_list = []

nlp = spacy.load("en_core_web_sm")


doc = nlp(scrapper.my_function())

for entity in doc.ents:
    entity_list.append((entity.text, entity.label_))
print(entity_list)

It just gives me the output:: in terminal for the scraped data along with error::它只是给了我输出 :: 在终端中用于抓取的数据以及错误 ::

** 
Traceback (most recent call last):
  File "hakuna_spacy.py", line 12, in <module>
    doc = nlp(printwo.pubb())
  File "C:\Users\Hp\AppData\Local\Programs\Python\Python37\lib\site-packages\spacy\language.py", 
line 423, in __call__
    if len(text) > self.max_length:

TypeError: object of type 'NoneType' has no len()

 **

What is that I'm doing wrong?我做错了什么？ Can someone explain me please?有人可以解释一下吗？

Answer 1

In your initial code snippet, you had the problem that pubb prints text to stdout but does not return a value.在您的初始代码片段中，您遇到了pubb文本打印到stdout但不返回值的问题。 You would try instead:你会尝试：

def pubb():
    return 'hello, world'

[Edit]: [编辑]：

In the edited version, there are some other issues I can see.在编辑后的版本中，我可以看到其他一些问题。

The fetch works, so:获取工作，所以：

>>> source = requests.get('https://en.wikipedia.org/wiki/Willis').text
>>> len(source)
36836

bs4 correctly finds the heading too: bs4 也正确地找到了标题：

>>> soup = BeautifulSoup(source,'lxml')
>>> soup.find('h1',{'id':'firstHeading'}).text
'Willis'

bs4 also finds an item in the content section (just 1): bs4 还在内容部分找到了一个项目（只有 1 个）：

>>> len(soup.select("#mw-content-text"))
1

The trouble then is that it doesn't find any content per se:问题是它本身没有找到任何内容：

>>> soup.select("#mw-content-text")[0].select("p")[1:3]
[]

And it doesn't find the caption:它没有找到标题：

>>> soup.find("caption",{"class":"fn org"})                                                                                                                                                                   
>>>

You also have the pre-existing issue that you are not returning any text from my_function , so the wrapper that passes the return value of that function into the spacy call is passed None which gives you the exception.您还有一个预先存在的问题，即您没有从my_function返回任何文本，因此将该函数的返回值传递到spacy调用的包装器被传递None这给了您异常。 What do you want my_function to return?你希望my_function返回什么？

从 spacy 中提取实体

问题描述

1 个解决方案

解决方案1
1 2020-01-14 06:20:04

从 spacy 中提取实体

问题描述

1 个解决方案

解决方案1 1 2020-01-14 06:20:04

解决方案1
1 2020-01-14 06:20:04