简体   繁体   中英

Selenium Python extracting information from a website and dumping it into JSON Format

I'm trying to open a Hotel website www.booking.com and extract the name, price, location, and link from the top 50 search results which are sorted by cheapest first. I'm using Selenium python to automate the process However some HTML elements are targetable while others are not. after inspecting the website I realized that all hotel names have the class name: fcab3ed991 a23c043802

I tried to target all of them and put them into an array as seen in my code below. But I can't seem to target the element correctly. What I'm I doing wrong?

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

PATH= "C:\Program Files (x86)\chromedriver.exe"
driver=webdriver.Chrome(PATH)
driver.get("https://www.booking.com/searchresults.html?label=gen173nr-1FCAEoggI46AdIM1gEaAKIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGoAgO4AvqR75YGwAIB0gIkZDQ4MTdjZDctYzIyNC00N2RlLWJhYjItZDU1YTAwMGU2M2Q12AIF4AIB&sid=8005d0cc6b75af8d0d2e74451b73cb8b&aid=304142&sb=1&sb_lp=1&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1FCAEoggI46AdIM1gEaAKIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGoAgO4AvqR75YGwAIB0gIkZDQ4MTdjZDctYzIyNC00N2RlLWJhYjItZDU1YTAwMGU2M2Q12AIF4AIB%26sid%3D8005d0cc6b75af8d0d2e74451b73cb8b%26sb_price_type%3Dtotal%26%26&ss=Jumeirah%2C+Dubai%2C+Dubai+Emirate%2C+United+Arab+Emirates&is_ski_area=&checkin_year=2022&checkin_month=8&checkin_monthday=1&checkout_year=2022&checkout_month=8&checkout_monthday=3&group_adults=2&group_children=0&no_rooms=1&map=1&b_h4u_keep_filters=&from_sf=1&ss_raw=jum&ac_position=1&ac_langcode=en&ac_click_type=b&dest_id=941&dest_type=district&place_id_lat=25.205553&place_id_lon=55.239216&search_pageview_id=c0ac477da63f02c2&search_pageview_id=c0ac477da63f02c2&search_selected=true&ac_suggestion_list_length=5&ac_suggestion_theme_list_length=0&order=price#map_closed")


try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "d4924c9e74"))
    )

    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CLASS_NAME, "fcab3ed991 a23c043802"))
    )
    names=element.find_elements_by_class_name("fcab3ed991 a23c043802")
except:
    driver.quit()

To extract the texts from the name and price fields you can use list comprehension and you can use the following locator strategies :

  • Code block:

     driver.execute("get", {'url': 'https://www.booking.com/searchresults.html?label=gen173nr-1FCAEoggI46AdIM1gEaAKIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGoAgO4AvqR75YGwAIB0gIkZDQ4MTdjZDctYzIyNC00N2RlLWJhYjItZDU1YTAwMGU2M2Q12AIF4AIB&sid=8005d0cc6b75af8d0d2e74451b73cb8b&aid=304142&sb=1&sb_lp=1&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Findex.html%3Flabel%3Dgen173nr-1FCAEoggI46AdIM1gEaAKIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGoAgO4AvqR75YGwAIB0gIkZDQ4MTdjZDctYzIyNC00N2RlLWJhYjItZDU1YTAwMGU2M2Q12AIF4AIB%26sid%3D8005d0cc6b75af8d0d2e74451b73cb8b%26sb_price_type%3Dtotal%26%26&ss=Jumeirah%2C+Dubai%2C+Dubai+Emirate%2C+United+Arab+Emirates&is_ski_area=&checkin_year=2022&checkin_month=8&checkin_monthday=1&checkout_year=2022&checkout_month=8&checkout_monthday=3&group_adults=2&group_children=0&no_rooms=1&map=1&b_h4u_keep_filters=&from_sf=1&ss_raw=jum&ac_position=1&ac_langcode=en&ac_click_type=b&dest_id=941&dest_type=district&place_id_lat=25.205553&place_id_lon=55.239216&search_pageview_id=c0ac477da63f02c2&search_pageview_id=c0ac477da63f02c2&search_selected=true&ac_suggestion_list_length=5&ac_suggestion_theme_list_length=0&order=price#map_closed'}) names = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[data-testid='title']")))] prices = [my_elem.text for my_elem in WebDriverWait(driver, 20).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "div[data-testid='price-and-discounted-price'] > span")))] for i,j in zip(names, prices): print(f"{i} hotel price is {j}")
  • Note : You have to add the following imports :

     from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC
  • Console Output:

     Royal Prestige Hotel hotel price is ₹ 10,871 Rove La Mer Beach hotel price is ₹ 10,328 Dubai Marine Beach Resort & Spa hotel price is ₹ 12,133 Roda Beach Resort hotel price is ₹ 16,525 Bespoke Residences - 3 Bedroom Waikiki Townhouses hotel price is ₹ 20,395 Walking distance to Burj al Arab - 1BR Lamtara 2 hotel price is ₹ 16,724 Mandarin Oriental Jumeira, Dubai hotel price is ₹ 18,108 Four Seasons Resort Dubai at Jumeirah Beach hotel price is ₹ 20,003 Bulgari Resort, Dubai hotel price is ₹ 78,274 Spacious Villa! hotel price is ₹ 62,619 Palm Beach Hotel hotel price is ₹ 64,794 York International Hotel hotel price is ₹ 86,971 Moon , Backpackers , Partition for Couples and for singles hotel price is ₹ 208,731 Hafez Hotel Apartments Al Ras Metro Station hotel price is ₹ 2,022 Grand Pearl Hostel For Boys hotel price is ₹ 2,131 Time Palace Hotel Branch hotel price is ₹ 3,131 Hostel Youth hotel price is ₹ 3,157 Grand Mayfair Hotel hotel price is ₹ 3,601 Explore Old Dubai, Souks, Tastings, Museums hotel price is ₹ 4,592 Panorama Hotel Bur Dubai hotel price is ₹ 3,674 Zain International Hotel hotel price is ₹ 3,827 Panorama Hotel Deira hotel price is ₹ 3,870 Decent Boys Hostel in center of Bur Dubai next to Burjuman metro Station with all FREE Facilities hotel price is ₹ 3,875 Brand New Boys Hostel 1 min walk from Burjuman Metro Station EXIT-4 with all Brand New Furnishings & Free Facilities hotel price is ₹ 3,914 OYO 338 Transworld Hotel hotel price is ₹ 3,914

PS: Following this solution you can similarly extract the location and link texts as well and dump in a JSON format.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM