简体   繁体   中英

Why am I not able to get BeautifulSoup to work as described?

I am pretty new to Beautiful Soup, so I am willing to accept I am probably doing something pretty stupid, never the less, after reading through the documentation as well as following about 4 different online tutorials, I am not having the success I am expecting. But first let me explain the use case.

The objective is to initiate a search against a holiday home website such as in this case stays with a specific set of criteria but changing the dates so that I can work out when I will get best value for a holiday. I want to store all the returned results into a database for further analysis.

But the first step is being able to capture the results. So here is my code.

# beautiful soup libraries
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import urllib.request

# Define & request the url that we want to scrape
url = r"https://www.stayz.com.au/search/keywords:warrnambool-victoria-australia/arrival:2020-10-23/departure:2020-10-25/minBedrooms/3?petIncluded=false"
html_content = urllib.request.urlopen(url)

# Pass the html_content(the webpage) through our beautiful soup object
soup = BeautifulSoup(html_content, 'html.parser')

so far so good, this will return a copy of the expected page... I think! So now I want to find my specific section of the webpage, below is a screenshot of the section I am trying to scrape.

and here is the relevant section of HTML

 <div class="media-flex__body"> <h2 class="HitInfo__headline hover-text" aria-hidden="true">Merri Beach House - Opposite Beach with spectacular Views &amp; Free Wi Fi</h2> <span class="sr-only">Property 1: Merri Beach House - Opposite Beach with spectacular Views &amp; Free Wi Fi</span> <div class="HitInfo__details"> <div class="Details__propertyType Details__label" aria-hidden="true">House</div> <div class="Details__bedrooms Details__label" aria-hidden="true">4 BR</div> <div class="Details__bathrooms Details__label" aria-hidden="true">2 BA</div> <div class="Details__sleeps Details__label" aria-hidden="true">Sleeps 9</div> <div class="Details__label" aria-hidden="true">5 m<sup>2</sup></div> <div class="sr-only"><span>Property TypeHouse</span><span>4Bedrooms</span><span>2Bathrooms</span><span>9Sleeps</span><span>5Square Meters</span></div> </div> <div class="GeoDistance"> <svg xmlns="http://www.w3.org/2000/svg" class="GeoDistance__icon" width="16" height="16" viewBox="0 0 16 16"> <g fill="none" fill-rule="evenodd" stroke="#5E6D77" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5"> <path class="GeoDistance__iconPinPath fill-transparent stroke-currentColor" d="M3.95 9.113a5.11 5.11 0 0 1 .546-6.579l.038-.038a5.11 5.11 0 0 1 7.226 0l.037.038a5.11 5.11 0 0 1 .548 6.58L8.147 15 3.95 9.113z"></path> <path class="GeoDistance__iconPinHole fill-transparent stroke-currentColor" d="M9.84 6.146a1.692 1.692 0 1 1-3.387 0 1.694 1.694 0 0 1 3.387 0z"></path> </g> </svg> <span class="GeoDistance__text">12 min. walk to the beach</span> </div> </div>

So from what I have read, I should be able to perform the following search, and this should give me what I need:

inital_search = soup.find_all('div', class_="media-flex__body")

however, I get no results returned.

I have also tried going further up the tree and initiating a search against class="HitCollection" , which should return all the results if I am understanding things correctly. This does return a result, but it looks like it's a place holder rather than the actual result.

This makes me wonder if I need to use a different method to grab search results as opposed to what I would do if I was scraping a static page.

The results from my second search attempt are below. I am not very experienced with web page design, so perhaps it's something obvious to those of you out there who are. I greatly appreciate any assistance,

 <div class="HitCollection HitCollection--placeholder"> <div aria-busy="true" class="Hit media-flex media-flex--left media-flex--xs" data-wdio="hit-placeholder"> <div class="LoadingPlaceholder thumbnail--noMargin media-flex__figure Hit__loadingThumbnail"><div class="LoadingPlaceholder__inner"></div></div> <div class="Hit__infoPlaceholder--mobile" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__pricePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__reviewsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> <div class="Hit__infoPlaceholder--desktop" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__urgencyPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__headlinePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__infoBarPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> </div> <div aria-busy="true" class="Hit media-flex media-flex--left media-flex--xs" data-wdio="hit-placeholder"> <div class="LoadingPlaceholder thumbnail--noMargin media-flex__figure Hit__loadingThumbnail"><div class="LoadingPlaceholder__inner"></div></div> <div class="Hit__infoPlaceholder--mobile" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__pricePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__reviewsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> <div class="Hit__infoPlaceholder--desktop" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__urgencyPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__headlinePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__infoBarPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> </div> <div aria-busy="true" class="Hit media-flex media-flex--left media-flex--xs" data-wdio="hit-placeholder"> <div class="LoadingPlaceholder thumbnail--noMargin media-flex__figure Hit__loadingThumbnail"><div class="LoadingPlaceholder__inner"></div></div> <div class="Hit__infoPlaceholder--mobile" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__pricePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__reviewsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> <div class="Hit__infoPlaceholder--desktop" data-wdio="HitPlaceholder"> <div class="LoadingPlaceholder Hit__urgencyPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__headlinePlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__detailsPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> <div class="LoadingPlaceholder Hit__infoBarPlaceholder"><div class="LoadingPlaceholder__inner"></div></div> </div> </div> </div>

This should help you:

from bs4 import BeautifulSoup

html = '<div class="media-flex__body"><h2 class="HitInfo__headline hover-text" aria-hidden="true">Merri Beach House - Opposite Beach with spectacular Views &amp; Free Wi Fi</h2><span class="sr-only">Property 1: Merri Beach House - Opposite Beach with spectacular Views &amp; Free Wi Fi</span><div class="HitInfo__details"><div class="Details__propertyType Details__label" aria-hidden="true">House</div><div class="Details__bedrooms Details__label" aria-hidden="true">4 BR</div><div class="Details__bathrooms Details__label" aria-hidden="true">2 BA</div><div class="Details__sleeps Details__label" aria-hidden="true">Sleeps 9</div><div class="Details__label" aria-hidden="true">5 m<sup>2</sup></div><div class="sr-only"><span>Property TypeHouse</span><span>4Bedrooms</span><span>2Bathrooms</span><span>9Sleeps</span><span>5Square Meters</span></div></div><div class="GeoDistance"><svg xmlns="http://www.w3.org/2000/svg" class="GeoDistance__icon" width="16" height="16" viewBox="0 0 16 16"><g fill="none" fill-rule="evenodd" stroke="#5E6D77" stroke-linecap="round" stroke-linejoin="round" stroke-width="1.5"><path class="GeoDistance__iconPinPath fill-transparent stroke-currentColor" d="M3.95 9.113a5.11 5.11 0 0 1 .546-6.579l.038-.038a5.11 5.11 0 0 1 7.226 0l.037.038a5.11 5.11 0 0 1 .548 6.58L8.147 15 3.95 9.113z"></path><path class="GeoDistance__iconPinHole fill-transparent stroke-currentColor" d="M9.84 6.146a1.692 1.692 0 1 1-3.387 0 1.694 1.694 0 0 1 3.387 0z"></path></g></svg><span class="GeoDistance__text">12 min. walk to the beach</span></div></div>'

soup = BeautifulSoup(html,'html5lib')

div = soup.find('div',class_ = "media-flex__body")

print(div.h2.text)

Output:

Merri Beach House - Opposite Beach with spectacular Views & Free Wi Fi

If u directly wanna access the h2 tag, then use this:

h2 = soup.find('h2',class_ = "HitInfo__headline hover-text")

print(h2.text)

Output:

Merri Beach House - Opposite Beach with spectacular Views & Free Wi Fi

Plus, another thing I recommend you to do is to use selenium instead of urllib (because the page loads dynamically) to get the html code, like this:

from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
html_content = driver.page_source

And change your parser from html.parser to lxml . Thus, here is the final code to extract the first title in the page:

from bs4 import BeautifulSoup
from selenium import webdriver
import time
# Define & request the url that we want to scrape
url = r"https://www.stayz.com.au/search/keywords:warrnambool-victoria-australia/arrival:2020-10-23/departure:2020-10-25/minBedrooms/3?petIncluded=false"

driver = webdriver.Chrome()
driver.get(url)
time.sleep(3)
html_content = driver.page_source
soup = BeautifulSoup(html_content,'lxml')
driver.close()

div = soup.find('div',class_ = "media-flex__body")

print(div.h2.text)

Output:

Merri Beach House - Opposite Beach with spectacular Views & Free Wi Fi

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM