繁体   English   中英

BeautifulSoup4找不到适当的元素

[英]BeautifulSoup4 doesn't find elements properly

我正在使用请求和bs4从链接http://duckduckgo.com/?q=who+is+harry+potter中提取第一个预览

但是,当我尝试使用bs4的find方法来查找带有“ result__snippet”类的div时,它将返回None。 但是,当我将整个网页保存到硬盘上并直接打开并用bs4解析时,汤soup.find('div', class_='result__snippet').get_text()返回理想的输出。

有什么帮助吗?

您链接到的网站似乎使用JavaScript来构建搜索结果,因此您使用BeautifulSoup检索的页面实际上尚未包含搜索结果。

如果查看已检索页面的内容( print(soup.text) ),则可以看到它们表明如果您没有启用JavaScript来使用http://duckduckgo.com/html/? q =谁+是+哈里+波特

爬取该URL应该为您提供所需的内容。

一种方法是将Selenium与BeautifulSoup结合使用。 试试这个,行得通。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup as bs
from fake_useragent import UserAgent

url = 'https://duckduckgo.com/?q=who+is+harry+potter&ia=web'

profile = webdriver.FirefoxProfile()
ua1 = UserAgent()
profile.set_preference('general.useragent.override', str(ua1.random))
driver = webdriver.Firefox(profile)
driver.get(url)
while True:
    try:
        WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'result__snippet')))
        print('Page is ready!')
        break 
    except TimeoutException:
        print('Loading took too much time!')
html = driver.execute_script('return document.body.innerHTML')
driver.close()

b_html = bs(html,'html.parser') 
x = b_html.find_all('div', class_='result__snippet')[0].get_text()

输出:

Harry Potter is a series of fantasy novels written by British author J. K. Rowling. The novels chronicle the life of a young wizard, Harry Potter, ...

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM