简体   繁体   English

使用 Python 从不同 url 列表中抓取文本

[英]Scrape text from a list of different urls using Python

I have a list of different URLs which I would like to scrape the text from using Python.我有一个不同 URL 的列表,我想从使用 Python 中抓取文本。 So far I've managed to build a script that returns URLs based on a Google Search with keywords, however I would now like to scrape the content of these URLs.到目前为止,我已经设法构建了一个脚本,该脚本基于带有关键字的 Google 搜索返回 URL,但是我现在想抓取这些 URL 的内容。 The problem is that I'm now scraping the ENTIRE website including the layout/style info, while I only would like to scrape the 'visible text'.问题是我现在正在抓取整个网站,包括布局/样式信息,而我只想抓取“可见文本”。 Ultimately, my goal is to scrape for names of all these urls, and store them in a pandas DataFrame.最终,我的目标是获取所有这些 url 的名称,并将它们存储在 pandas DataFrame 中。 Perhaps even include how often certain names are mentioned, but that is for later.甚至可能包括某些名字被提及的频率,但那是以后的事了。 Below is a rather simple start of my code so far:到目前为止,下面是我的代码的一个相当简单的开始:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
import spacy
import en_core_web_sm
import pandas as pd

url_list = ["https://www.nhtsa.gov/winter-driving-safety", "https://www.safetravelusa.com/", "https://www.theatlantic.com/business/archive/2014/01/how-2-inches-of-snow-created-a-traffic-nightmare-in-atlanta/283434/", "https://www.wsdot.com/traffic/passes/stevens/"]

df = pd.DataFrame(url_list, columns = ['url'])
df_Names = []

# load english language model
nlp = en_core_web_sm.load()

# find Names in text
def spacy_entity(df):    
    df1 = nlp(df)
    df2 = [[w.text,w.label_] for w in df1.ents]
    return df2

for index, url in  df.iterrows():
    print(index)
    print(url)
    sleep(randint(2,5))
    # print(page)
    req = Request(url[0], headers={"User-Agent": 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage, 'html5lib').get_text()
    df_Names.append(spacy_entity(soup))
df["Names"] = df_Names

For getting the visible text with BeautifoulSoup, there is already this answer: BeautifulSoup Grab Visible Webpage Text为了使用 BeautifoulSoup 获取可见文本,已经有这个答案: BeautifulSoup Grab Visible Webpage Text

Once you get your visible text, if you want to extract "names" (I'm assuming by names here you mean "nouns"), you can check nltk package (or Blob) on this other answer: Extracting all Nouns from a text file using nltk一旦你得到你的可见文本,如果你想提取“名字”(我在这里假设你的名字是指“名词”),你可以在这个其他答案上检查 nltk package (或 Blob): Extracting all Nouns from a text使用 nltk 的文件

Once you apply both, you can ingest your outputs into pandas DataFrame.两者都应用后,您可以将输出提取到 pandas DataFrame 中。

Note : Please notice that extracting the visible text given an HTML it is still an open problem.注意:请注意,在给定 HTML 的情况下提取可见文本仍然是一个悬而未决的问题。 This two papers can highlight the problem way better than I can and they are both using Machine Learning techniques: https://arxiv.org/abs/1801.02607 , https://dl.acm.org/doi/abs/10.1145/3366424.3383547 .这两篇论文可以比我更好地突出问题,并且它们都使用机器学习技术: https://arxiv.org/abs/1801.02607,https ://dl.acm.org/355476/24.1385436。 . And their respective githubs https://github.com/dalab/web2text , https://github.com/mrjleo/boilernet以及他们各自的github https://github.com/dalab/web2text , https://github.com/mrjleo/boilernet

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM