[英]Scraping a website by clicking each hyperlinks in a page
我試圖在網頁上刮取一個有住宿地點廣告的頁面。 為了讓我獲得該地點的地址,我需要單擊每個地點,然后刮取一個地址部分,然后返回到下一個地點。 此過程持續幾頁。
我正在使用漂亮的湯刮和硒來訪問瀏覽器。
import urllib2
import csv
from bs4 import BeautifulSoup
import pandas as pd
import selenium
from selenium import webdriver
import time
from time import sleep
erasmusu_base = "https://erasmusu.com/en/erasmus-barcelona/student-housing?english=1&id=261&p="
hdr= {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent' : "Magic Browser"}
# For the time being lets scrape up to page 2
# This for loop is for moving to the next pages
for i in range(0,2):
erasmusu = erasmusu_base+str(i)
page = urllib2.Request(erasmusu, headers=hdr )
content = urllib2.urlopen(page).read()
browser = webdriver.Chrome()
browser.get(erasmusu_base)
ad = browser.find_element_by_xpath('/html/body/div[1]/div[1]/div[3]/div[3]/div[2]/div/ul/li[1]/div[2]/h3/a')
ad.click()
首先,我試圖點擊廣告並打開一個標簽,以便獲取價格信息。 然后,我將針對其余廣告繼續此過程。
您將不得不使用Selenium來獲得網站上的保護。
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
import re
erasmusu_base = "https://erasmusu.com/en/erasmus-barcelona/student-housing?english=1&id=261&p="
hdr= {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent' : "Magic Browser"}
# For the time being lets scrape up to page 2
# This for loop is for moving to the next pages
browser = webdriver.Chrome()
for i in range(0, 2):
browser.get(erasmusu_base)
sleep(5)
# Hide cookie pop up.
if browser.find_elements_by_css_selector("a.hide_cookies_panel"):
browser.find_element_by_css_selector("a.hide_cookies_panel").click()
sleep(1)
# Get a list of links to visit.
hrefs = [a.get_attribute('href') for a in browser.find_elements_by_css_selector('h3 a')]
# For each link get link.
for href in hrefs:
browser.get(href)
sleep(5)
# Use BeautifulSoup to parse address.
soup = BeautifulSoup(browser.page_source, 'html.parser')
# Use regex for address text and find price by class
print(soup.find(text=re.compile('^Address|^Dirección')).parent.text, soup.find('span', {'class':'priceflat'}).text)
browser.quit()
輸出:
Address: Carrer de la Garrotxa, 7, 08041 Barcelona, Spain 450 € / month
Address: Carrer de la Boqueria, 08002 Barcelona, España 800 € / month
Address: Carrer del Dos de Maig, Barcelona, Spain 495 € / month
Address: Carrer de Sant Ferran, 08031 Barcelona, Spain 340 € / month
Dirección: Carrer d'Arenys, 08035 Barcelona, Spain 400 € / mes
...
如果您使用的是Python 2.7,請添加為第一行:
# -*- coding: utf-8 -*-
並更改:
re.compile('^Address|^Dirección')
至
re.compile(ur'^Address|^Dirección')
它在終端中看起來會很恐怖,但是如果您將其寫入文件,看起來會很好。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.