简体   繁体   English

Python-使用BeautifulSoup在页面内抓取多个类

[英]Python - Crawl Multiple Classes within a Page Using BeautifulSoup

I am trying to crawl Agoda's daily hotel price of multiple room types along with additional information such as the promotion information, breakfast condition, and book-now-pay-later regulation. 我正在尝试获取Agoda多种房型的每日酒店价格以及其他信息,例如促销信息,早餐条件和现在预订后付款的规定。

The codes I have are as below: 我的代码如下:

import requests
import math
from bs4 import BeautifulSoup

url = "http://www.agoda.com/ambassador-hotel-taipei/hotel/taipei-tw.html?asq=8m91A1C3D%252bTr%252bvRSmuClW5dm5vJXWO5dlQmHx%252fdU9qxilNob5hJg0b218wml6rCgncYsXBK0nWktmYtQJCEMu0P07Y3BjaTYhdrZvavpUnmfy3moWn%252bv8f2Lfx7HovrV95j6mrlCfGou99kE%252bA0aX0aof09AStNs69qUxvAVo53D4ZTrmAxm3bVkqZJr62cU&tyra=1%257c2&searchrequestid=2e2b0e8c-cadb-465b-8dea-2222e24a1678&pingnumber=1&checkin=2015-10-01&los=1"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
n = len(soup.select('.room-name'))

for i in range(0, n):
    en_room = soup.select('.room-name')[i].text.strip()
    currency = soup.select('.currency')[i].text
    price = soup.select('.sellprice')[i].text

    try:
        sp_info = soup.select('.left-room-info')[i].text.strip()
    except Exception as e:
        sp_info = "N/A"

    try:
        pay_later = soup.select('.book-now-paylater')[i].text.strip()
    except Exception as e:
        pay_later = "N/A"


    print en_room, i+1, currency, price, en_room, sp_info, pay_later
    time.sleep(1)

I have two questions: 我有两个问题:

(1) The "left-room-info" class seems to contain two sub-classes "breakfast" and "room-promo". (1)“ left-room-info”类似乎包含两个子类“ breakfast”和“ room-promo”。 These sub-classes only show up when the particular room type provides such services. 仅当特定房间类型提供此类服务时,才会显示这些子类。

When there is only one of the sub-classes shows up, the output works out well. 当仅显示一个子类时,输出效果很好。 However, when none of the sub-classes shows up, the output is empty when I expect to show "N/A". 但是,当没有子类出现时,当我希望显示“ N / A”时,输出为空。 Also when both of the sub-classes show up, the output format has unnecessary empty lines which cannot be removed by .strip(). 同样,当两个子类都出现时,输出格式将包含不必要的空行,这些空行无法通过.strip()删除。

Is there any way to solve these problems? 有什么办法可以解决这些问题?

(2) When I tried to extract information from the class '.book-now-paylater', the extracted data does not match each room type. (2)当我尝试从“ .book-now-paylater”类中提取信息时,提取的数据与每种房间类型都不匹配。 For example, assuming there are 10 room types and only room 2, 4, 6, 8 allow travelers to book now pay later, the codes can extract exactly 4 pieces of book-now-pay-later information but these 4 pieces of information are then assigned inappropriately to room type 1, 2, 3, 4. 例如,假设有10种房型,只有2号,4号,6号,8号房允许旅行者现在预订,以后这些代码就可以准确提取4条“立即付款”信息,但是这4条信息是然后不适当地分配了房间类型1、2、3、4。

Is there any way to fix this problem? 有什么办法可以解决这个问题?

Thank you for your help! 谢谢您的帮助!

Gary 加里

(1) This is happening because even if there is no text in the '.left-room-info' selection, it won't throw an exception, and your except will never run. (1)之所以会这样,是因为即使'.left-room-info'选择中没有文本,它也不会引发异常,并且您的except将永远不会运行。 You should be checking to see if the value is an empty string ( '' ). 你应该检查,看看是否值是一个空字符串( '' )。 You can do this with a simple if not string_var like this 你可以用一个简单的, if not string_var这样的方式来做到这一点

sp_info = soup.select('.left-room-info')[i].text.strip()
if not sp_info:
    sp_info = "N/A"

When both subclasses show up, you should split the string on the carriage return ( '\\r' ) and then strip each of the resulting pieces. 当两个子类都出现时,您应该在回车符( '\\r' )上分割字符串,然后剥离每个结果块。 The code would look something like this: (note that now sp_info is a list, not just a string) 代码看起来像这样:(请注意,现在sp_info是一个列表,而不仅仅是一个字符串)

sp_info = soup.select('.left-room-info')[i].text.strip().split('\r')
if len(sp_info) > 1:
    sp_info = [ info.strip() for info in sp_info ]

Putting these pieces together, we'll get something like this 将这些片段放在一起,我们将得到类似的内容

sp_info = soup.select('.left-room-info')[i].text.strip().split('\r')
if len(sp_info) > 1:
    sp_info = [ info.strip() for info in sp_info ]
elif not sp_info[0]: # check for empty string
    sp_info = ["N/A"] # keep sp_info a list for consistancy 

(2) is a little more complicated. (2)有点复杂。 You're going to have to change how you parse the page. 您将不得不更改解析页面的方式。 Namely, you're probably going to have to select on .room-type . 即,您可能需要在.room-type上进行选择。 The way you're selecting the book now pay laters, it doesn't associate them with any other elements, it just selects the 8 instances of that class. 您选择书籍的方式现在需要以后支付,它不会将它们与任何其他元素相关联,而只是选择该类的8个实例。 Here is how I would go about doing it: 这是我要做的事情:

import requests
import math
from bs4 import BeautifulSoup

url = "http://www.agoda.com/ambassador-hotel-taipei/hotel/taipei-tw.html?asq=8m91A1C3D%252bTr%252bvRSmuClW5dm5vJXWO5dlQmHx%252fdU9qxilNob5hJg0b218wml6rCgncYsXBK0nWktmYtQJCEMu0P07Y3BjaTYhdrZvavpUnmfy3moWn%252bv8f2Lfx7HovrV95j6mrlCfGou99kE%252bA0aX0aof09AStNs69qUxvAVo53D4ZTrmAxm3bVkqZJr62cU&tyra=1%257c2&searchrequestid=2e2b0e8c-cadb-465b-8dea-2222e24a1678&pingnumber=1&checkin=2015-10-01&los=1"
res = requests.get(url)
soup = BeautifulSoup(res.text)

rooms = soup.select('.room-type')[1:] # the first instance of the class isn't a room

room_list = []

for room in rooms:
    room_info = {}

    room_info['en_room'] = room.select('.room-name')[0].text.strip()
    room_info['currency'] = room.select('.currency')[0].text.strip()
    room_info['price'] = room.select('.sellprice')[0].text.strip()

    sp_info = room.select('.left-room-info')[0].text.strip().split('\r')
    if len(sp_info) > 1:
        sp_info = ", ".join([ info.strip() for info in sp_info ])
    elif not sp_info[0]: # check for empty string
        sp_info = "N/A"
    room_info['sp_info'] = sp_info

    pay_later = room.select('.book-now-paylater')
    room_info['pay_later'] = pay_later[0].text.strip() if pay_later else "N/A"

    room_list.append(room_info)

In your code, you are not traversing the dom correctly. 在您的代码中,您没有正确遍历dom。 This will cause problems in scraping. 这将导致刮刮问题。 (eg second problem). (例如第二个问题)。 I shall give suggestive code snippet(not exact solution) hopeing you could solve the first problem by yourself. 我将提供一些提示性的代码段(不是确切的解决方案),希望您可以自己解决第一个问题。

# select all room types by tables tr tag
room_types = soup.find_all('tr', class_="room-type")

# iterate over the list to scrape data form each td or div inside tr
for room in room_types:
    en_room = room.find('div', class_='room-name').text.strip()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM