简体   繁体   中英

Website Scraping Specific Forms

For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.

How would one go about scraping all occurrences of the 'elqFormRow' on the whole website ? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()

soup = bs.BeautifulSoup(sauce, 'lxml')

for div in soup.find_all('div', class_='elqFormRow'):
    print(div.text.strip())

You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:

import bs4 as bs
import requests

domain = "engage.hpe.com"
initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage'

# get urls to scrape
text = requests.get(initial_url).text
initial_soup = bs.BeautifulSoup(text, 'lxml')
tags = initial_soup.findAll('a', href=True)

urls = []
for tag in tags:
    if domain in tag:
        urls.append(tag['href'])
urls.append(initial_url)

print(urls)

# function to grab your info
def scrape_desired_info(url):
    out = []
    text = requests.get(url).text
    soup = bs.BeautifulSoup(text, 'lxml')
    for div in soup.find_all('div', class_='elqFormRow'):
        out.append(div.text.strip())
        return out



info = [scrape_desired_info(url) for url in urls if domain in url]

URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.

Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM