简体   繁体   中英

Scraping a website by clicking each hyperlinks in a page

I am trying to web scrape a page where there are ads of places to stay. In order for me to get the address of this places I need to click each one of them and just scrape an address part then come back do this to the next one. This process goes for several pages.

I am using beautiful-soup for scraping and selenium for browser access.

import urllib2
import csv
from bs4 import BeautifulSoup
import pandas as pd

import selenium
from selenium import webdriver
import time
from time import sleep

erasmusu_base = "https://erasmusu.com/en/erasmus-barcelona/student-housing?english=1&id=261&p="
hdr= {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent' : "Magic Browser"}

# For the time being lets scrape up to page 2
# This for loop is for moving to the next pages
for i in range(0,2):
    erasmusu = erasmusu_base+str(i)
    page = urllib2.Request(erasmusu, headers=hdr )
    content = urllib2.urlopen(page).read()

    browser = webdriver.Chrome()
    browser.get(erasmusu_base)

    ad = browser.find_element_by_xpath('/html/body/div[1]/div[1]/div[3]/div[3]/div[2]/div/ul/li[1]/div[2]/h3/a')
    ad.click()

First, I am trying to click the ad and open a tab so that I can get the price information. Then I will continue this process for the rest of the ads.

You will have to use Selenium to get round the protection on the website.

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
import re

erasmusu_base = "https://erasmusu.com/en/erasmus-barcelona/student-housing?english=1&id=261&p="
hdr= {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent' : "Magic Browser"}

# For the time being lets scrape up to page 2
# This for loop is for moving to the next pages
browser = webdriver.Chrome()
for i in range(0, 2):
    browser.get(erasmusu_base)
    sleep(5)
    # Hide cookie pop up.
    if browser.find_elements_by_css_selector("a.hide_cookies_panel"):
        browser.find_element_by_css_selector("a.hide_cookies_panel").click()
    sleep(1)
    # Get a list of links to visit.
    hrefs = [a.get_attribute('href') for a in browser.find_elements_by_css_selector('h3 a')]
    # For each link get link.
    for href in hrefs:
        browser.get(href)
        sleep(5)
        # Use BeautifulSoup to parse address.
        soup = BeautifulSoup(browser.page_source, 'html.parser')
        # Use regex for address text and find price by class
        print(soup.find(text=re.compile('^Address|^Dirección')).parent.text, soup.find('span', {'class':'priceflat'}).text)
browser.quit()

Outputs:

Address: Carrer de la Garrotxa, 7, 08041 Barcelona, Spain 450 € / month
Address: Carrer de la Boqueria, 08002 Barcelona, España 800 € / month
Address: Carrer del Dos de Maig, Barcelona, Spain 495 € / month
Address: Carrer de Sant Ferran, 08031 Barcelona, Spain 340 € / month
Dirección: Carrer d'Arenys, 08035 Barcelona, Spain 400 € / mes
...

If you are on Python 2.7 add as the first line:

# -*- coding: utf-8 -*-

and change:

re.compile('^Address|^Dirección')

to

re.compile(ur'^Address|^Dirección')

It will look horrible in a terminal but if you write it to file will look OK.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM