通过单击页面中的每个超链接来抓取网站

Question

我试图在网页上刮取一个有住宿地点广告的页面。 为了让我获得该地点的地址，我需要单击每个地点，然后刮取一个地址部分，然后返回到下一个地点。 此过程持续几页。

我正在使用漂亮的汤刮和硒来访问浏览器。

import urllib2
import csv
from bs4 import BeautifulSoup
import pandas as pd

import selenium
from selenium import webdriver
import time
from time import sleep

erasmusu_base = "https://erasmusu.com/en/erasmus-barcelona/student-housing?english=1&id=261&p="
hdr= {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent' : "Magic Browser"}

# For the time being lets scrape up to page 2
# This for loop is for moving to the next pages
for i in range(0,2):
    erasmusu = erasmusu_base+str(i)
    page = urllib2.Request(erasmusu, headers=hdr )
    content = urllib2.urlopen(page).read()

    browser = webdriver.Chrome()
    browser.get(erasmusu_base)

    ad = browser.find_element_by_xpath('/html/body/div[1]/div[1]/div[3]/div[3]/div[2]/div/ul/li[1]/div[2]/h3/a')
    ad.click()

首先，我试图点击广告并打开一个标签，以便获取价格信息。 然后，我将针对其余广告继续此过程。

Answer 1

您将不得不使用Selenium来获得网站上的保护。

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
import re

erasmusu_base = "https://erasmusu.com/en/erasmus-barcelona/student-housing?english=1&id=261&p="
hdr= {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent' : "Magic Browser"}

# For the time being lets scrape up to page 2
# This for loop is for moving to the next pages
browser = webdriver.Chrome()
for i in range(0, 2):
    browser.get(erasmusu_base)
    sleep(5)
    # Hide cookie pop up.
    if browser.find_elements_by_css_selector("a.hide_cookies_panel"):
        browser.find_element_by_css_selector("a.hide_cookies_panel").click()
    sleep(1)
    # Get a list of links to visit.
    hrefs = [a.get_attribute('href') for a in browser.find_elements_by_css_selector('h3 a')]
    # For each link get link.
    for href in hrefs:
        browser.get(href)
        sleep(5)
        # Use BeautifulSoup to parse address.
        soup = BeautifulSoup(browser.page_source, 'html.parser')
        # Use regex for address text and find price by class
        print(soup.find(text=re.compile('^Address|^Dirección')).parent.text, soup.find('span', {'class':'priceflat'}).text)
browser.quit()

输出：

Address: Carrer de la Garrotxa, 7, 08041 Barcelona, Spain 450 € / month
Address: Carrer de la Boqueria, 08002 Barcelona, España 800 € / month
Address: Carrer del Dos de Maig, Barcelona, Spain 495 € / month
Address: Carrer de Sant Ferran, 08031 Barcelona, Spain 340 € / month
Dirección: Carrer d'Arenys, 08035 Barcelona, Spain 400 € / mes
...

如果您使用的是Python 2.7，请添加为第一行：

# -*- coding: utf-8 -*-

并更改：

re.compile('^Address|^Dirección')

至

re.compile(ur'^Address|^Dirección')

它在终端中看起来会很恐怖，但是如果您将其写入文件，看起来会很好。

通过单击页面中的每个超链接来抓取网站

问题描述

1 个解决方案

解决方案1
0 2019-09-15 12:44:59

通过单击页面中的每个超链接来抓取网站

问题描述

1 个解决方案

解决方案1 0 2019-09-15 12:44:59

解决方案1
0 2019-09-15 12:44:59