简体   繁体   English

没有[href]的多层网站上的Python网络抓取

[英]Python web-scraping on a multi-layered website without [href]

I am looking for a way to scrape data from the student-accomodation website uniplaces: https://www.uniplaces.com/en/accommodation/berlin . 我正在寻找一种从学生住宿网站uniplaces抓取数据的方法: https ://www.uniplaces.com/en/accommodation/berlin。

In the end, I would like to scrape particular information for each property, such as bedroom size, number of roommates, location. 最后,我想为每个属性抓取一些具体信息,例如卧室大小,室友数量,位置。 In order to do this, I will first have to scrape all property links and then scrape the individual links afterwards. 为此,我将首先必须刮除所有属性链接,然后再刮除单个链接。

However, even after going through the console and using BeautifulSoup for the extraction of urls, I was not able to extract the urls leading to the separate listings. 但是,即使经过控制台并使用BeautifulSoup提取了URL,我也无法提取导致单独列出的URL。 They don't seem to be included as a [href] and I wasn't able to identify the links in any other format within the html code. 它们似乎没有作为[href]包含在内,并且我无法在html代码中标识任何其他格式的链接。

This is the python code I used but it also didn't return anything: from bs4 import BeautifulSoup import urllib.request 这是我使用的python代码,但也未返回任何内容:从bs4 import BeautifulSoup import urllib.request

resp = urllib.request.urlopen("https://www.uniplaces.com/accommodation/lisbon")
soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

for link in soup.find_all('a', href=True):
    print(link['href'])

So my question is: If links are not included in http:// format or referenced as [href]: is there any way to extract the listings urls? 所以我的问题是:如果链接未包含在http://格式中或未引用为[href]:有什么方法可以提取列表网址?

I would really highly appreciate any support on this! 我非常感谢对此的任何支持!

All the best, Hannah 一切顺利,汉娜

If you look at the network tab, you find some API call specifically to this url : https://www.uniplaces.com/api/search/offers?city=PT-lisbon&limit=24&locale=en_GB&ne=38.79507211908374%2C-9.046124472314432&page=1&sw=38.68769060641113%2C-9.327992453271463 如果您查看“网络”标签,则会发现一些专门针对此URL的API调用: https : //www.uniplaces.com/api/search/offers? city = PT-lisbon & limit =24& locale = en_GB & ne = 38.79507211908374%2C-9.046124472314432 &page = 1&SW = 38.68769060641113%2C-9.327992453271463

which specifies the location PT-lisbon and northest(ne) and southwest(sw) direction. 它指定了位置PT里斯本以及最北(ne)和西南(sw)方向。 From this file, you can get the id for each offers and append it to the current url, you can also get all info you get from the webpage (price, description etc...) 从此文件中,您可以获取每个优惠的ID并将其附加到当前网址,还可以获取从网页中获取的所有信息(价格,说明等...)

For instance : 例如 :

import requests

resp = requests.get(
    url = 'https://www.uniplaces.com/api/search/offers', 
    params = {
        "city":'PT-lisbon',
        "limit":'24',
        "locale":'en_GB',
        "ne":'38.79507211908374%2C-9.046124472314432',
        "page":'1',
        "sw":'38.68769060641113%2C-9.327992453271463'
    })
body = resp.json()

base_url = 'https://www.uniplaces.com/accommodation/lisbon'

data = [
    (
        t['id'],                  #offer id
        base_url + '/' + t['id'], #this is the offer page
        t['attributes']['accommodation_offer']['title'], 
        t['attributes']['accommodation_offer']['price']['amount'],
        t['attributes']['accommodation_offer']['available_from']
    )
    for t in body['data']
]

print(data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM