简体   繁体   English

通过单击页面中的每个超链接来抓取网站

[英]Scraping a website by clicking each hyperlinks in a page

I am trying to web scrape a page where there are ads of places to stay. 我试图在网页上刮取一个有住宿地点广告的页面。 In order for me to get the address of this places I need to click each one of them and just scrape an address part then come back do this to the next one. 为了让我获得该地点的地址,我需要单击每个地点,然后刮取一个地址部分,然后返回到下一个地点。 This process goes for several pages. 此过程持续几页。

I am using beautiful-soup for scraping and selenium for browser access. 我正在使用漂亮的汤刮和硒来访问浏览器。

import urllib2
import csv
from bs4 import BeautifulSoup
import pandas as pd

import selenium
from selenium import webdriver
import time
from time import sleep

erasmusu_base = "https://erasmusu.com/en/erasmus-barcelona/student-housing?english=1&id=261&p="
hdr= {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent' : "Magic Browser"}

# For the time being lets scrape up to page 2
# This for loop is for moving to the next pages
for i in range(0,2):
    erasmusu = erasmusu_base+str(i)
    page = urllib2.Request(erasmusu, headers=hdr )
    content = urllib2.urlopen(page).read()

    browser = webdriver.Chrome()
    browser.get(erasmusu_base)

    ad = browser.find_element_by_xpath('/html/body/div[1]/div[1]/div[3]/div[3]/div[2]/div/ul/li[1]/div[2]/h3/a')
    ad.click()

First, I am trying to click the ad and open a tab so that I can get the price information. 首先,我试图点击广告并打开一个标签,以便获取价格信息。 Then I will continue this process for the rest of the ads. 然后,我将针对其余广告继续此过程。

You will have to use Selenium to get round the protection on the website. 您将不得不使用Selenium来获得网站上的保护。

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
import re

erasmusu_base = "https://erasmusu.com/en/erasmus-barcelona/student-housing?english=1&id=261&p="
hdr= {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent' : "Magic Browser"}

# For the time being lets scrape up to page 2
# This for loop is for moving to the next pages
browser = webdriver.Chrome()
for i in range(0, 2):
    browser.get(erasmusu_base)
    sleep(5)
    # Hide cookie pop up.
    if browser.find_elements_by_css_selector("a.hide_cookies_panel"):
        browser.find_element_by_css_selector("a.hide_cookies_panel").click()
    sleep(1)
    # Get a list of links to visit.
    hrefs = [a.get_attribute('href') for a in browser.find_elements_by_css_selector('h3 a')]
    # For each link get link.
    for href in hrefs:
        browser.get(href)
        sleep(5)
        # Use BeautifulSoup to parse address.
        soup = BeautifulSoup(browser.page_source, 'html.parser')
        # Use regex for address text and find price by class
        print(soup.find(text=re.compile('^Address|^Dirección')).parent.text, soup.find('span', {'class':'priceflat'}).text)
browser.quit()

Outputs: 输出:

Address: Carrer de la Garrotxa, 7, 08041 Barcelona, Spain 450 € / month
Address: Carrer de la Boqueria, 08002 Barcelona, España 800 € / month
Address: Carrer del Dos de Maig, Barcelona, Spain 495 € / month
Address: Carrer de Sant Ferran, 08031 Barcelona, Spain 340 € / month
Dirección: Carrer d'Arenys, 08035 Barcelona, Spain 400 € / mes
...

If you are on Python 2.7 add as the first line: 如果您使用的是Python 2.7,请添加为第一行:

# -*- coding: utf-8 -*-

and change: 并更改:

re.compile('^Address|^Dirección')

to

re.compile(ur'^Address|^Dirección')

It will look horrible in a terminal but if you write it to file will look OK. 它在终端中看起来会很恐怖,但是如果您将其写入文件,看起来会很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM