简体   繁体   中英

Find “Span” with specific class with BeautifulSoup

This is what i want to find

I'm trying to find all these elements:

<span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk>

I have tried using:

page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
spans = soup.findAll('span', {"class": "BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk"})

print(spans)

with "URL" and "headers" previously declared but it returns to me: "[]"

URL

How can i modify my code?

This is a tricky one and here is how I scraped it. You have to use Selenium though not BeautifulSoup due to JavaScript. I am using FireFox and geckodriver version 0.24.

You will also have to call an implicit wait to let the page finish loading to match the version you are seeing when you click page source on your browser. To understand why, read this paragraph

/** * Get the source of the last loaded page. If the page has been modified after loading (for * example, by Javascript) there is no guarantee that the returned text is that of the modified * page. Please consult the documentation of the particular driver being used to determine whether * the returned text reflects the current state of the page or the text last sent by the web * server. The page source returned is a representation of the underlying DOM: do not expect it to * be formatted or escaped in the same way as the response sent from the web server. Think of it * as an artist's impression. *

from Selenium source code .

Code

import os
import requests
from bs4 import BeautifulSoup
import lxml
from selenium import webdriver

url = 'https://www.skyscanner.it/trasporti/voli/berl/amst/191231/200102/?adultsv2=1&childrenv2=&cabinclass=economy&rtn=1&preferdirects=true&outboundaltsenabled=false&inboundaltsenabled=false&qp_prevProvider=ins_browse&qp_prevCurrency=EUR&priceSourceId=taps-taps&qp_prevPrice=116#/'
driver = webdriver.Firefox(executable_path=r'(put your path here)\geckodriver-v0.24.0-win64\geckodriver.exe')
driver.get(url)
#there will be differences in div id='app-root'
#in page_selenium.txt with and without implicit wait
driver.implicitly_wait(10)

#with selenium
html_selenium = driver.page_source
bs_selenium = BeautifulSoup(html_selenium, 'lxml')
with open('page_selenium.txt', 'w', encoding='utf-8') as outfile:
    outfile.write(bs_selenium.prettify())

#with requests
html_req = requests.get(url)
bs_req = BeautifulSoup(html_req.text,'lxml')
with open('page_bs.txt', 'w', encoding='utf-8') as outfile:
    outfile.write(bs_req.prettify())

#open and compare div id='app-root' in page_selenium.txt and page_bs.txt and you will understand why your method didn't work

#now scrape using the bs from selenium
spanner = bs_selenium.find_all('span',{'class':'BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk'})

print(spanner)

#terminate the browser
os.system('tskill plugin-container')
driver.close()
driver.quit()

Output

[<span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 172</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 99</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 115</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 115</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 115</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 136</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 136</span>, <span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--lg__3vAKN BpkText_bpk-text--bold__4yauk">€ 136</span>]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM