简体   繁体   中英

Scrape text from a list of different urls using Python

I have a list of different URLs which I would like to scrape the text from using Python. So far I've managed to build a script that returns URLs based on a Google Search with keywords, however I would now like to scrape the content of these URLs. The problem is that I'm now scraping the ENTIRE website including the layout/style info, while I only would like to scrape the 'visible text'. Ultimately, my goal is to scrape for names of all these urls, and store them in a pandas DataFrame. Perhaps even include how often certain names are mentioned, but that is for later. Below is a rather simple start of my code so far:

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
import spacy
import en_core_web_sm
import pandas as pd

url_list = ["https://www.nhtsa.gov/winter-driving-safety", "https://www.safetravelusa.com/", "https://www.theatlantic.com/business/archive/2014/01/how-2-inches-of-snow-created-a-traffic-nightmare-in-atlanta/283434/", "https://www.wsdot.com/traffic/passes/stevens/"]

df = pd.DataFrame(url_list, columns = ['url'])
df_Names = []

# load english language model
nlp = en_core_web_sm.load()

# find Names in text
def spacy_entity(df):    
    df1 = nlp(df)
    df2 = [[w.text,w.label_] for w in df1.ents]
    return df2

for index, url in  df.iterrows():
    # print(page)
    req = Request(url[0], headers={"User-Agent": 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    soup = BeautifulSoup(webpage, 'html5lib').get_text()
df["Names"] = df_Names

For getting the visible text with BeautifoulSoup, there is already this answer: BeautifulSoup Grab Visible Webpage Text

Once you get your visible text, if you want to extract "names" (I'm assuming by names here you mean "nouns"), you can check nltk package (or Blob) on this other answer: Extracting all Nouns from a text file using nltk

Once you apply both, you can ingest your outputs into pandas DataFrame.

Note : Please notice that extracting the visible text given an HTML it is still an open problem. This two papers can highlight the problem way better than I can and they are both using Machine Learning techniques: https://arxiv.org/abs/1801.02607 , https://dl.acm.org/doi/abs/10.1145/3366424.3383547 . And their respective githubs https://github.com/dalab/web2text , https://github.com/mrjleo/boilernet

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM