繁体   English   中英

BeautifulSoup 无法在终端上正确解析 html,但可以在我的 Jupyter Notebook 中使用

[英]BeautifulSoup is not parsing html correctly on terminal, but works in my Jupyter Notebook

我目前正在使用 python 和漂亮的汤学习基本的网络抓取。 我在我的 Jupyter Notebook 中做了一些事情并且它起作用了,但是当我从终端中的 .py 文件运行相同的代码时,BeautifulSoup 似乎没有正确解析,并且没有打印出来。

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import pandas as pd

driver = webdriver.Chrome(executable_path="/Users/Shiva/Downloads/chromedriver")

driver.get('https://www.google.com/flights?hl=en#flt=/m/03v_5.IAD.2019-02-10*IAD./m/03v_5.2019-02-11;c:USD;e:1;sd:1;t:f')

load_all_flights = driver.find_element_by_xpath('//*[@id="flt-app"]/div[2]/main[4]/div[7]/div[1]/div[3]/div[4]/div[5]/div[1]/div[3]/jsl/a[1]/span[1]/span[2]')

load_all_flights.click()

soup = BeautifulSoup(driver.page_source, 'html.parser')

info = soup.find_all('div', class_="gws-flights-results__collapsed-itinerary gws-flights-results__itinerary")

for trip in info:
    price = trip.find('div', class_="flt-subhead1 gws-flights-results__price gws-flights-results__cheapest-price")
    if price == None:
        price = trip.find('div', class_="flt-subhead1 gws-flights-results__price")
    type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal gws-flights-results__has-warning-icon")
    if type_of_flight == None:
        type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal")
    print(str(type_of_flight.text).strip()  + " : " + str(price.text).strip())

在 jupyter 笔记本中,我得到了一个航班类型和价格的列表“不间断:500 美元”

但它在终端中不起作用,因为“信息”变量是一个空列表

您需要等待页面呈现。 Jupyter 获取数据的原因是它足够慢(或者您有不同的单元格)在解析页面之前呈现页面。 以下应该可以解决问题:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pandas as pd

driver = webdriver.Chrome(executable_path="C:\\Users\\Andersen\\Desktop\\Tools\\chromedriver.exe")

driver.get('https://www.google.com/flights?hl=en#flt=/m/03v_5.IAD.2019-02-10*IAD./m/03v_5.2019-02-11;c:USD;e:1;sd:1;t:f')

xpath = '//*[@id="flt-app"]/div[2]/main[4]/div[7]/div[1]/div[3]/div[4]/div[5]/div[1]/div[3]/jsl/a[1]/span[1]/span[2]'

wait = WebDriverWait(driver, 10)
confirm = wait.until(EC.element_to_be_clickable((By.XPATH, xpath)))

load_all_flights = driver.find_element_by_xpath(xpath)

load_all_flights.click()

soup = BeautifulSoup(driver.page_source, 'html.parser')

info = soup.find_all('div', class_="gws-flights-results__collapsed-itinerary gws-flights-results__itinerary")

for trip in info:
    price = trip.find('div', class_="flt-subhead1 gws-flights-results__price gws-flights-results__cheapest-price")
    if price == None:
        price = trip.find('div', class_="flt-subhead1 gws-flights-results__price")
    type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal gws-flights-results__has-warning-icon")
    if type_of_flight == None:
        type_of_flight = trip.find('div', class_="gws-flights-results__stops flt-subhead1Normal")
    print(str(type_of_flight.text).strip()  + " : " + str(price.text).strip())

输出(截至 2019-02-02):

2 stops : $588
Nonstop : $749
Nonstop : $749
1 stop : $866
2 stops : $1,271
2 stops : $1,294
2 stops : $1,294
2 stops : $1,805
2 stops : $1,805

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM