使用 Python 从不同 url 列表中抓取文本

Question

我有一个不同 URL 的列表，我想从使用 Python 中抓取文本。 到目前为止，我已经设法构建了一个脚本，该脚本基于带有关键字的 Google 搜索返回 URL，但是我现在想抓取这些 URL 的内容。 问题是我现在正在抓取整个网站，包括布局/样式信息，而我只想抓取“可见文本”。 最终，我的目标是获取所有这些 url 的名称，并将它们存储在 pandas DataFrame 中。 甚至可能包括某些名字被提及的频率，但那是以后的事了。 到目前为止，下面是我的代码的一个相当简单的开始：

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
import spacy
import en_core_web_sm
import pandas as pd

url_list = ["https://www.nhtsa.gov/winter-driving-safety", "https://www.safetravelusa.com/", "https://www.theatlantic.com/business/archive/2014/01/how-2-inches-of-snow-created-a-traffic-nightmare-in-atlanta/283434/", "https://www.wsdot.com/traffic/passes/stevens/"]

df = pd.DataFrame(url_list, columns = ['url'])
df_Names = []

# load english language model
nlp = en_core_web_sm.load()

# find Names in text
def spacy_entity(df):    
    df1 = nlp(df)
    df2 = [[w.text,w.label_] for w in df1.ents]
    return df2

for index, url in  df.iterrows():
    print(index)
    print(url)
    sleep(randint(2,5))
    # print(page)
    req = Request(url[0], headers={"User-Agent": 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage, 'html5lib').get_text()
    df_Names.append(spacy_entity(soup))
df["Names"] = df_Names

Answer 1

为了使用 BeautifoulSoup 获取可见文本，已经有这个答案： BeautifulSoup Grab Visible Webpage Text

一旦你得到你的可见文本，如果你想提取“名字”（我在这里假设你的名字是指“名词”），你可以在这个其他答案上检查 nltk package （或 Blob）： Extracting all Nouns from a text使用 nltk 的文件

两者都应用后，您可以将输出提取到 pandas DataFrame 中。

注意：请注意，在给定 HTML 的情况下提取可见文本仍然是一个悬而未决的问题。 这两篇论文可以比我更好地突出问题，并且它们都使用机器学习技术： https://arxiv.org/abs/1801.02607，https ://dl.acm.org/355476/24.1385436。 . 以及他们各自的github https://github.com/dalab/web2text , https://github.com/mrjleo/boilernet

使用 Python 从不同 url 列表中抓取文本

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-02-09 11:24:51

使用 Python 从不同 url 列表中抓取文本

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-02-09 11:24:51

解决方案1
0 已采纳 2021-02-09 11:24:51