Python-使用BeautifulSoup在页面内抓取多个类

Question

I am trying to crawl Agoda's daily hotel price of multiple room types along with additional information such as the promotion information, breakfast condition, and book-now-pay-later regulation. 我正在尝试获取Agoda多种房型的每日酒店价格以及其他信息，例如促销信息，早餐条件和现在预订后付款的规定。

The codes I have are as below: 我的代码如下：

import requests
import math
from bs4 import BeautifulSoup

url = "http://www.agoda.com/ambassador-hotel-taipei/hotel/taipei-tw.html?asq=8m91A1C3D%252bTr%252bvRSmuClW5dm5vJXWO5dlQmHx%252fdU9qxilNob5hJg0b218wml6rCgncYsXBK0nWktmYtQJCEMu0P07Y3BjaTYhdrZvavpUnmfy3moWn%252bv8f2Lfx7HovrV95j6mrlCfGou99kE%252bA0aX0aof09AStNs69qUxvAVo53D4ZTrmAxm3bVkqZJr62cU&tyra=1%257c2&searchrequestid=2e2b0e8c-cadb-465b-8dea-2222e24a1678&pingnumber=1&checkin=2015-10-01&los=1"
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
n = len(soup.select('.room-name'))

for i in range(0, n):
    en_room = soup.select('.room-name')[i].text.strip()
    currency = soup.select('.currency')[i].text
    price = soup.select('.sellprice')[i].text

    try:
        sp_info = soup.select('.left-room-info')[i].text.strip()
    except Exception as e:
        sp_info = "N/A"

    try:
        pay_later = soup.select('.book-now-paylater')[i].text.strip()
    except Exception as e:
        pay_later = "N/A"


    print en_room, i+1, currency, price, en_room, sp_info, pay_later
    time.sleep(1)

I have two questions: 我有两个问题：

(1) The "left-room-info" class seems to contain two sub-classes "breakfast" and "room-promo". （1）“ left-room-info”类似乎包含两个子类“ breakfast”和“ room-promo”。 These sub-classes only show up when the particular room type provides such services. 仅当特定房间类型提供此类服务时，才会显示这些子类。

When there is only one of the sub-classes shows up, the output works out well. 当仅显示一个子类时，输出效果很好。 However, when none of the sub-classes shows up, the output is empty when I expect to show "N/A". 但是，当没有子类出现时，当我希望显示“ N / A”时，输出为空。 Also when both of the sub-classes show up, the output format has unnecessary empty lines which cannot be removed by .strip(). 同样，当两个子类都出现时，输出格式将包含不必要的空行，这些空行无法通过.strip（）删除。

Is there any way to solve these problems? 有什么办法可以解决这些问题？

(2) When I tried to extract information from the class '.book-now-paylater', the extracted data does not match each room type. （2）当我尝试从“ .book-now-paylater”类中提取信息时，提取的数据与每种房间类型都不匹配。 For example, assuming there are 10 room types and only room 2, 4, 6, 8 allow travelers to book now pay later, the codes can extract exactly 4 pieces of book-now-pay-later information but these 4 pieces of information are then assigned inappropriately to room type 1, 2, 3, 4. 例如，假设有10种房型，只有2号，4号，6号，8号房允许旅行者现在预订，以后这些代码就可以准确提取4条“立即付款”信息，但是这4条信息是然后不适当地分配了房间类型1、2、3、4。

Is there any way to fix this problem? 有什么办法可以解决这个问题？

Thank you for your help! 谢谢您的帮助！

Gary 加里

Answer 1

(1) This is happening because even if there is no text in the '.left-room-info' selection, it won't throw an exception, and your except will never run. （1）之所以会这样，是因为即使'.left-room-info'选择中没有文本，它也不会引发异常，并且您的except将永远不会运行。 You should be checking to see if the value is an empty string ( '' ). 你应该检查，看看是否值是一个空字符串（ '' ）。 You can do this with a simple if not string_var like this 你可以用一个简单的， if not string_var这样的方式来做到这一点

sp_info = soup.select('.left-room-info')[i].text.strip()
if not sp_info:
    sp_info = "N/A"

When both subclasses show up, you should split the string on the carriage return ( '\\r' ) and then strip each of the resulting pieces. 当两个子类都出现时，您应该在回车符（ '\\r' ）上分割字符串，然后剥离每个结果块。 The code would look something like this: (note that now sp_info is a list, not just a string) 代码看起来像这样：（请注意，现在sp_info是一个列表，而不仅仅是一个字符串）

sp_info = soup.select('.left-room-info')[i].text.strip().split('\r')
if len(sp_info) > 1:
    sp_info = [ info.strip() for info in sp_info ]

Putting these pieces together, we'll get something like this 将这些片段放在一起，我们将得到类似的内容

sp_info = soup.select('.left-room-info')[i].text.strip().split('\r')
if len(sp_info) > 1:
    sp_info = [ info.strip() for info in sp_info ]
elif not sp_info[0]: # check for empty string
    sp_info = ["N/A"] # keep sp_info a list for consistancy

(2) is a little more complicated. （2）有点复杂。 You're going to have to change how you parse the page. 您将不得不更改解析页面的方式。 Namely, you're probably going to have to select on .room-type . 即，您可能需要在.room-type上进行选择。 The way you're selecting the book now pay laters, it doesn't associate them with any other elements, it just selects the 8 instances of that class. 您选择书籍的方式现在需要以后支付，它不会将它们与任何其他元素相关联，而只是选择该类的8个实例。 Here is how I would go about doing it: 这是我要做的事情：

import requests
import math
from bs4 import BeautifulSoup

url = "http://www.agoda.com/ambassador-hotel-taipei/hotel/taipei-tw.html?asq=8m91A1C3D%252bTr%252bvRSmuClW5dm5vJXWO5dlQmHx%252fdU9qxilNob5hJg0b218wml6rCgncYsXBK0nWktmYtQJCEMu0P07Y3BjaTYhdrZvavpUnmfy3moWn%252bv8f2Lfx7HovrV95j6mrlCfGou99kE%252bA0aX0aof09AStNs69qUxvAVo53D4ZTrmAxm3bVkqZJr62cU&tyra=1%257c2&searchrequestid=2e2b0e8c-cadb-465b-8dea-2222e24a1678&pingnumber=1&checkin=2015-10-01&los=1"
res = requests.get(url)
soup = BeautifulSoup(res.text)

rooms = soup.select('.room-type')[1:] # the first instance of the class isn't a room

room_list = []

for room in rooms:
    room_info = {}

    room_info['en_room'] = room.select('.room-name')[0].text.strip()
    room_info['currency'] = room.select('.currency')[0].text.strip()
    room_info['price'] = room.select('.sellprice')[0].text.strip()

    sp_info = room.select('.left-room-info')[0].text.strip().split('\r')
    if len(sp_info) > 1:
        sp_info = ", ".join([ info.strip() for info in sp_info ])
    elif not sp_info[0]: # check for empty string
        sp_info = "N/A"
    room_info['sp_info'] = sp_info

    pay_later = room.select('.book-now-paylater')
    room_info['pay_later'] = pay_later[0].text.strip() if pay_later else "N/A"

    room_list.append(room_info)

Answer 2

In your code, you are not traversing the dom correctly. 在您的代码中，您没有正确遍历dom。 This will cause problems in scraping. 这将导致刮刮问题。 (eg second problem). （例如第二个问题）。 I shall give suggestive code snippet(not exact solution) hopeing you could solve the first problem by yourself. 我将提供一些提示性的代码段（不是确切的解决方案），希望您可以自己解决第一个问题。

# select all room types by tables tr tag
room_types = soup.find_all('tr', class_="room-type")

# iterate over the list to scrape data form each td or div inside tr
for room in room_types:
    en_room = room.find('div', class_='room-name').text.strip()

Python-使用BeautifulSoup在页面内抓取多个类

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-09-18 17:42:05

解决方案2
1 2015-09-18 15:55:21

Python-使用BeautifulSoup在页面内抓取多个类

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-09-18 17:42:05

解决方案2 1 2015-09-18 15:55:21

解决方案1
2 已采纳 2015-09-18 17:42:05

解决方案2
1 2015-09-18 15:55:21