Website Scraping Specific Forms

Question

For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.

How would one go about scraping all occurrences of the 'elqFormRow' on the whole website ? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.

import bs4 as bs
import urllib.request

sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()

soup = bs.BeautifulSoup(sauce, 'lxml')

for div in soup.find_all('div', class_='elqFormRow'):
    print(div.text.strip())

Answer 1

You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:

import bs4 as bs
import requests

domain = "engage.hpe.com"
initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage'

# get urls to scrape
text = requests.get(initial_url).text
initial_soup = bs.BeautifulSoup(text, 'lxml')
tags = initial_soup.findAll('a', href=True)

urls = []
for tag in tags:
    if domain in tag:
        urls.append(tag['href'])
urls.append(initial_url)

print(urls)

# function to grab your info
def scrape_desired_info(url):
    out = []
    text = requests.get(url).text
    soup = bs.BeautifulSoup(text, 'lxml')
    for div in soup.find_all('div', class_='elqFormRow'):
        out.append(div.text.strip())
        return out



info = [scrape_desired_info(url) for url in urls if domain in url]

URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.

Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.

Website Scraping Specific Forms

Question

1 answers

solution1
0 ACCPTED 2016-11-21 06:25:23

Website Scraping Specific Forms

Question

1 answers

solution1 0 ACCPTED 2016-11-21 06:25:23

solution1
0 ACCPTED 2016-11-21 06:25:23