How to scrape plaintext from multiple links off of one website?

Question

from bs4 import BeautifulSoup
import bs4 as bs
import pandas as pd
import numpy as py
import json
import csv
import re
import urllib.request
sauce = 
urllib.request.urlopen("https://www.imdb.com/list/ls003073623/").read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
soup.findAll('a', href=re.compile('^/title/'))

I am trying to scrape multiple links off of a website (about 500) and I don't want to manually input each and every URL, how do I go about scraping this?

Answer 1

With BeautifulSoup

If I understand it right, you are trying to obtain a list containing a part of all the links on a given website. There is an example on BeautifulSoup's documentation that shows exactly how to do that:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://www.imdb.com/list/ls003073623/")
soup = BeautifulSoup(html_page)
ids = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    ids.append(link.get('href').split("/")[4])

print(ids)

With Selenium

For reference, and since it doesn't seem like the question is limited to only BeautifulSoup, here's how we would do the same using Selenium, a very popular alternative.

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.imdb.com/list/ls003073623/")

ids = []
elems = driver.find_elements_by_xpath("//a[@href]")

for elem in elems:
    ids.append(elem.get_attribute("href").split("/")[4])

print(ids)

How to scrape plaintext from multiple links off of one website?

Question

1 answers

solution1
0 2018-10-08 00:10:46

With BeautifulSoup

With Selenium

How to scrape plaintext from multiple links off of one website?

Question

1 answers

solution1 0 2018-10-08 00:10:46

With BeautifulSoup

With Selenium

solution1
0 2018-10-08 00:10:46