Scrape text from a list of different urls using Python

Question

I have a list of different URLs which I would like to scrape the text from using Python. So far I've managed to build a script that returns URLs based on a Google Search with keywords, however I would now like to scrape the content of these URLs. The problem is that I'm now scraping the ENTIRE website including the layout/style info, while I only would like to scrape the 'visible text'. Ultimately, my goal is to scrape for names of all these urls, and store them in a pandas DataFrame. Perhaps even include how often certain names are mentioned, but that is for later. Below is a rather simple start of my code so far:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
import spacy
import en_core_web_sm
import pandas as pd

url_list = ["https://www.nhtsa.gov/winter-driving-safety", "https://www.safetravelusa.com/", "https://www.theatlantic.com/business/archive/2014/01/how-2-inches-of-snow-created-a-traffic-nightmare-in-atlanta/283434/", "https://www.wsdot.com/traffic/passes/stevens/"]

df = pd.DataFrame(url_list, columns = ['url'])
df_Names = []

# load english language model
nlp = en_core_web_sm.load()

# find Names in text
def spacy_entity(df):    
    df1 = nlp(df)
    df2 = [[w.text,w.label_] for w in df1.ents]
    return df2

for index, url in  df.iterrows():
    print(index)
    print(url)
    sleep(randint(2,5))
    # print(page)
    req = Request(url[0], headers={"User-Agent": 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage, 'html5lib').get_text()
    df_Names.append(spacy_entity(soup))
df["Names"] = df_Names

Answer 1

For getting the visible text with BeautifoulSoup, there is already this answer: BeautifulSoup Grab Visible Webpage Text

Once you get your visible text, if you want to extract "names" (I'm assuming by names here you mean "nouns"), you can check nltk package (or Blob) on this other answer: Extracting all Nouns from a text file using nltk

Once you apply both, you can ingest your outputs into pandas DataFrame.

Note : Please notice that extracting the visible text given an HTML it is still an open problem. This two papers can highlight the problem way better than I can and they are both using Machine Learning techniques: https://arxiv.org/abs/1801.02607 , https://dl.acm.org/doi/abs/10.1145/3366424.3383547 . And their respective githubs https://github.com/dalab/web2text , https://github.com/mrjleo/boilernet

Scrape text from a list of different urls using Python

Question

1 answers

solution1
0 ACCPTED 2021-02-09 11:24:51

Scrape text from a list of different urls using Python

Question

1 answers

solution1 0 ACCPTED 2021-02-09 11:24:51

solution1
0 ACCPTED 2021-02-09 11:24:51