简体   繁体   中英

python scrape flight and price data from skyscanner

I am trying to get the price data from the following url. However I can only seem to get the text from 'div's down to a certain level, here is my code:

from selenium import webdriver
from bs4 import BeautifulSoup

def scrape_flight_prices(URL):

    browser = webdriver.PhantomJS()
    # PARSE THE HTML
    browser.get(URL)
    soup = BeautifulSoup(browser.page_source, "lxml")
    page_divs = soup.findAll("div", attrs={'id':'app-root'}) 
    for p in page_divs:
        print(p)

if __name__ == '__main__':
  URL1="https://www.skyscanner.net/transport/flights/brs/gnb/190216/190223/?adults=1&children=0&adultsv2=1&childrenv2=&infants=0&cabinclass=economy&rtn=1&preferdirects=false&outboundaltsenabled=false&inboundaltsenabled=false&ref=home#results"

And here is the output:

<div id="app-root">
<section class="day-content state-loading state-no-results" id="daysection">
<div class="day-searching">
<div class="hot-spinner medium"></div>
<div class="day-searching-message">Searching</div>
</div>
</section>
</div>

The section of html I want to scrape from looks like this:

https://www.skyscanner.net/transport/flights/brs/gnb/190216/190223/?adults=1&children=0&adultsv2=1&childrenv2=&infants=0&cabinclass=economy&rtn=1&preferdirects=false&outboundaltsenabled=false&inboundaltsenabled=false&ref=home#results

However when I try and scrape with the following code:

prices = soup.findAll("a", attrs={'target':"_blank", "data-e2e":"itinerary-price", "class":"CTASection__price-2bc7h price"})  
for p in prices:
    print(p)

It prints nothing! I suspect a js script is running something to generate the rest of the the code and/or data? Can anyone help me extract the data? Specifically I am trying to get the price, flight times, airline name etc but if beautiful soup is not printing the relevant html from the page then I'm not sure how else to get it?

Would appreciate any pointers! Many thanks in advance!

Try below code to get prices:

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC

prices = [price.text for price in wait(browser, 10).until(EC.presence_of_all_elements_located((By.CLASS_NAME, "price")))]
print(prices)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM