繁体   English   中英

通过单击页面中的每个超链接来抓取网站

[英]Scraping a website by clicking each hyperlinks in a page

我试图在网页上刮取一个有住宿地点广告的页面。 为了让我获得该地点的地址,我需要单击每个地点,然后刮取一个地址部分,然后返回到下一个地点。 此过程持续几页。

我正在使用漂亮的汤刮和硒来访问浏览器。

import urllib2
import csv
from bs4 import BeautifulSoup
import pandas as pd

import selenium
from selenium import webdriver
import time
from time import sleep

erasmusu_base = "https://erasmusu.com/en/erasmus-barcelona/student-housing?english=1&id=261&p="
hdr= {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent' : "Magic Browser"}

# For the time being lets scrape up to page 2
# This for loop is for moving to the next pages
for i in range(0,2):
    erasmusu = erasmusu_base+str(i)
    page = urllib2.Request(erasmusu, headers=hdr )
    content = urllib2.urlopen(page).read()

    browser = webdriver.Chrome()
    browser.get(erasmusu_base)

    ad = browser.find_element_by_xpath('/html/body/div[1]/div[1]/div[3]/div[3]/div[2]/div/ul/li[1]/div[2]/h3/a')
    ad.click()

首先,我试图点击广告并打开一个标签,以便获取价格信息。 然后,我将针对其余广告继续此过程。

您将不得不使用Selenium来获得网站上的保护。

from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
import re

erasmusu_base = "https://erasmusu.com/en/erasmus-barcelona/student-housing?english=1&id=261&p="
hdr= {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'User-Agent' : "Magic Browser"}

# For the time being lets scrape up to page 2
# This for loop is for moving to the next pages
browser = webdriver.Chrome()
for i in range(0, 2):
    browser.get(erasmusu_base)
    sleep(5)
    # Hide cookie pop up.
    if browser.find_elements_by_css_selector("a.hide_cookies_panel"):
        browser.find_element_by_css_selector("a.hide_cookies_panel").click()
    sleep(1)
    # Get a list of links to visit.
    hrefs = [a.get_attribute('href') for a in browser.find_elements_by_css_selector('h3 a')]
    # For each link get link.
    for href in hrefs:
        browser.get(href)
        sleep(5)
        # Use BeautifulSoup to parse address.
        soup = BeautifulSoup(browser.page_source, 'html.parser')
        # Use regex for address text and find price by class
        print(soup.find(text=re.compile('^Address|^Dirección')).parent.text, soup.find('span', {'class':'priceflat'}).text)
browser.quit()

输出:

Address: Carrer de la Garrotxa, 7, 08041 Barcelona, Spain 450 € / month
Address: Carrer de la Boqueria, 08002 Barcelona, España 800 € / month
Address: Carrer del Dos de Maig, Barcelona, Spain 495 € / month
Address: Carrer de Sant Ferran, 08031 Barcelona, Spain 340 € / month
Dirección: Carrer d'Arenys, 08035 Barcelona, Spain 400 € / mes
...

如果您使用的是Python 2.7,请添加为第一行:

# -*- coding: utf-8 -*-

并更改:

re.compile('^Address|^Dirección')

re.compile(ur'^Address|^Dirección')

它在终端中看起来会很恐怖,但是如果您将其写入文件,看起来会很好。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM