简体   繁体   English

使用 Beautiful Soup 和 Python 从搜索页面中提取 HTML 内容

[英]Extracting HTML content from a search page using Beautiful Soup with Python

I'm trying to get some hotels info from booking.com using Beautiful Soup.我正在尝试使用 Beautiful Soup 从booking.com 获取一些酒店信息。 I need to get certain info from all the accomodations in Spain.我需要从西班牙的所有住宿中获取某些信息。 This is the search url:这是搜索网址:

https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type=country&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&oos_flag=0&postcard=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&src_elem=sb&ss=Spain&ss_all=0&ss_raw=spain&ssb=empty&sshis=0&order=popularity https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type=country&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&oos_flag=0&postcard= 0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&src_elem=sb&ss=西班牙&ss_all=0&ss_raw=spain&ssb=empty&sshis=0&order=popularity

When I inspect an accomodation in the result page using the developer tools it says that this is the tag to search:当我使用开发人员工具检查结果页面中的住宿时,它说这是要搜索的标签:

 <a class="hotel_name_link url" href="&#10;/hotel/es/aran-la-abuela.html?label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM;sid=1677838e3fc7c26577ea908d40ad5faf;ucfs=1;srpvid=b4980e34f6e50017;srepoch=1514167274;room1=A%2CA;hpos=1;hapos=1;dest_type=country;dest_id=197;srfid=198499756e07f93263596e1640823813c2ee4fe1X1;from=searchresults&#10;;highlight_room=#hotelTmpl" target="_blank" rel="noopener"> <span class="sr-hotel__name " data-et-click=" customGoal:YPNdKNKNKZJUESUPTOdJDUFYQC:1 "> Hotel Spa Aran La Abuela </span> <span class="invisible_spoken">Opens in new window</span> </a>

This is my Python code:这是我的 Python 代码:

def init_BeautifulSoup():
    global page, soup
    page= requests.get("https://www.booking.com/searchresults.html?aid=304142&label=gen173nr-1DCAEoggJCAlhYSDNYBGigAYgBAZgBMbgBB8gBDNgBA-gBAfgBApICAXmoAgM&sid=1677838e3fc7c26577ea908d40ad5faf&class_interval=1&dest_id=197&dest_type=country&dtdisc=0&from_sf=1&group_adults=2&group_children=0&inac=0&index_postcard=0&label_click=undef&no_rooms=1&oos_flag=0&postcard=0&raw_dest_type=country&room1=A%2CA&sb_price_type=total&search_selected=1&src_elem=sb&ss=Spain&ss_all=0&ss_raw=spain&ssb=empty&sshis=0&order=popularity")
    soup = BeautifulSoup(page.content, 'html.parser')


def get_spain_accomodations():
    global accomodations
    accomodations = soup.find_all(class_="hotel_name_link.url")

But when I run the code and print the accomodations variable it outputs a pair of brackets ([]).但是当我运行代码并打印住宿变量时,它会输出一对括号 ([])。 Then I printed the soup object and I realized that the parsed HTML is very different from the one I see in the developer tools in Chrome, that's why the soup object cant find the class "hotel_name_link.url"然后我打印了汤对象,我意识到解析的 HTML 与我在 Chrome 中的开发人员工具中看到的非常不同,这就是汤对象找不到类“hotel_name_link.url”的原因

What's going on?这是怎么回事?

JavaScript is modifying the page after it loads. JavaScript 在页面加载后修改页面。 So, when you use page.content , it gives you the HTML content of the page before JS modifies the page.因此,当您使用page.content ,它会在 JS 修改页面之前为您提供页面的 HTML 内容。

You can use selenium to render the JS content.您可以使用selenium来呈现 JS 内容。 After the page loads, you can use driver.page_souce to get the page source after JS modifies it and pass it to BeautifulSoup.页面加载完成后,可以使用driver.page_souce获取JS修改后的页面源码,并传递给BeautifulSoup。

from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

def get_page(url):
    driver = webdriver.Chrome()
    driver.get(url)
    try:
        WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h1')))
    except TimeoutException:
        print('Page timed out.')
        return None
    page = driver.page_source
    return page

def init_BeautifulSoup():
    global page, soup
    page = get_page('your-url')
    # handle the case where page may be None
    soup = BeautifulSoup(page, 'html.parser')

EDIT:编辑:

You'll need to change one thing here.你需要在这里改变一件事。

What the part WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h1'))) does is that it makes the driver wait explicitly until the element is located on the webpage that we specify or throws TimeoutException after the delay time you specify (I've used 10 seconds). WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.TAG_NAME, 'h1')))作用是让驱动程序显式等待,直到元素位于我们指定的网页上或抛出TimeoutException在您指定的延迟时间之后(我使用了 10 秒)。

I've just provided you with an example.我刚刚给你提供了一个例子。 You need to find out the element on the loaded page that is not present before the execution of the JavaScript and replace it here: (By.TAG_NAME, 'h1')需要在JavaScript执行前找出加载页面上不存在的元素,并在此处替换: (By.TAG_NAME, 'h1')

You can do this by inspecting elements after the page is loaded and checking whether the element exists or not in the HTML code of the page source.您可以通过在页面加载后检查元素并检查该元素是否存在于页面源的 HTML 代码中来完成此操作。

Instead of By.TAG_NAME , you can use any of the following according to your requirement: ID , NAME , CLASS_NAME , CSS_SELECTOR , XPATH .除了By.TAG_NAME ,您可以根据您的要求使用以下任何一种: IDNAMECLASS_NAMECSS_SELECTORXPATH

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM