简体   繁体   中英

BeautifulSoup is not parsing html correctly on terminal, but works in my Jupyter Notebook

I'm currently learning basic web scraping using python and beautiful soup. I did some stuff in my Jupyter Notebook and it worked, but when I run the same code from a .py file in my terminal, BeautifulSoup does not seem to be parsing correctly, and nothing get printed out.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import pandas as pd

driver = webdriver.Chrome(executable_path="/Users/Shiva/Downloads/chromedriver")

driver.get('https://www.google.com/flights?hl=en#flt=/m/03v_5.IAD.2019-02-10*IAD./m/03v_5.2019-02-11;c:USD;e:1;sd:1;t:f')

load_all_flights = driver.find_element_by_xpath('//*[@id="flt-app"]/div[2]/main[4]/div[7]/div[1]/div[3]/div[4]/div[5]/div[1]/div[3]/jsl/a[1]/span[1]/span[2]')

load_all_flights.click()

soup = BeautifulSoup(driver.page_source, 'html.parser')

info = soup.find_all('div', class_="gws-flights-results__collapsed-itinerary gws-flights-results__itinerary")

for trip in info:
    price = trip.find('div', class_="flt-subhead1 gws-flights-results__price gws-flights-results__cheapest-price")
    if price == None:
        price = trip.find('div', class_="flt-subhead1 gws-flights-results__price")
    type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal gws-flights-results__has-warning-icon")
    if type_of_flight == None:
        type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal")
    print(str(type_of_flight.text).strip()  + " : " + str(price.text).strip())

In jupyter note book, I get a list of flight types and prices "nonstop: $500"

but it doesn't work in terminal as the "info" variable is an empty list

You need to wait for the page to render. The reason Jupyter gets data is that it is slow enough (or you have different cells) to render the page before you parse the page. The following should do the trick:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd

driver = webdriver.Chrome(executable_path="C:\\Users\\Andersen\\Desktop\\Tools\\chromedriver.exe")

driver.get('https://www.google.com/flights?hl=en#flt=/m/03v_5.IAD.2019-02-10*IAD./m/03v_5.2019-02-11;c:USD;e:1;sd:1;t:f')

xpath = '//*[@id="flt-app"]/div[2]/main[4]/div[7]/div[1]/div[3]/div[4]/div[5]/div[1]/div[3]/jsl/a[1]/span[1]/span[2]'

wait = WebDriverWait(driver, 10)
confirm = wait.until(EC.element_to_be_clickable((By.XPATH, xpath)))

load_all_flights = driver.find_element_by_xpath(xpath)

load_all_flights.click()

soup = BeautifulSoup(driver.page_source, 'html.parser')

info = soup.find_all('div', class_="gws-flights-results__collapsed-itinerary gws-flights-results__itinerary")

for trip in info:
    price = trip.find('div', class_="flt-subhead1 gws-flights-results__price gws-flights-results__cheapest-price")
    if price == None:
        price = trip.find('div', class_="flt-subhead1 gws-flights-results__price")
    type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal gws-flights-results__has-warning-icon")
    if type_of_flight == None:
        type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal")
    print(str(type_of_flight.text).strip()  + " : " + str(price.text).strip())

Output (as of 2019-02-02):

2 stops : $588
Nonstop : $749
Nonstop : $749
1 stop : $866
2 stops : $1,271
2 stops : $1,294
2 stops : $1,294
2 stops : $1,805
2 stops : $1,805

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM