简体   繁体   中英

How to scrape plaintext from multiple links off of one website?

from bs4 import BeautifulSoup
import bs4 as bs
import pandas as pd
import numpy as py
import json
import csv
import re
import urllib.request
sauce = 
urllib.request.urlopen("https://www.imdb.com/list/ls003073623/").read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
soup.findAll('a', href=re.compile('^/title/'))

I am trying to scrape multiple links off of a website (about 500) and I don't want to manually input each and every URL, how do I go about scraping this?

With BeautifulSoup

If I understand it right, you are trying to obtain a list containing a part of all the links on a given website. There is an example on BeautifulSoup's documentation that shows exactly how to do that:

from BeautifulSoup import BeautifulSoup
import urllib2
import re

html_page = urllib2.urlopen("https://www.imdb.com/list/ls003073623/")
soup = BeautifulSoup(html_page)
ids = []

for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
    ids.append(link.get('href').split("/")[4])

print(ids)

With Selenium

For reference, and since it doesn't seem like the question is limited to only BeautifulSoup, here's how we would do the same using Selenium, a very popular alternative.

from selenium import webdriver

driver = webdriver.Chrome()
driver.get("https://www.imdb.com/list/ls003073623/")

ids = []
elems = driver.find_elements_by_xpath("//a[@href]")

for elem in elems:
    ids.append(elem.get_attribute("href").split("/")[4])

print(ids)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM