BeautifulSoup4找不到适当的元素

Question

我正在使用请求和bs4从链接http://duckduckgo.com/?q=who+is+harry+potter中提取第一个预览

但是，当我尝试使用bs4的find方法来查找带有“ result__snippet”类的div时，它将返回None。 但是，当我将整个网页保存到硬盘上并直接打开并用bs4解析时，汤soup.find('div', class_='result__snippet').get_text()返回理想的输出。

有什么帮助吗？

Answer 1

您链接到的网站似乎使用JavaScript来构建搜索结果，因此您使用BeautifulSoup检索的页面实际上尚未包含搜索结果。

如果查看已检索页面的内容（ print(soup.text) ），则可以看到它们表明如果您没有启用JavaScript来使用http://duckduckgo.com/html/？ q =谁+是+哈里+波特。

爬取该URL应该为您提供所需的内容。

Answer 2

一种方法是将Selenium与BeautifulSoup结合使用。 试试这个，行得通。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup as bs
from fake_useragent import UserAgent

url = 'https://duckduckgo.com/?q=who+is+harry+potter&ia=web'

profile = webdriver.FirefoxProfile()
ua1 = UserAgent()
profile.set_preference('general.useragent.override', str(ua1.random))
driver = webdriver.Firefox(profile)
driver.get(url)
while True:
    try:
        WebDriverWait(driver, delay).until(EC.presence_of_element_located((By.CLASS_NAME, 'result__snippet')))
        print('Page is ready!')
        break 
    except TimeoutException:
        print('Loading took too much time!')
html = driver.execute_script('return document.body.innerHTML')
driver.close()

b_html = bs(html,'html.parser') 
x = b_html.find_all('div', class_='result__snippet')[0].get_text()

输出：

Harry Potter is a series of fantasy novels written by British author J. K. Rowling. The novels chronicle the life of a young wizard, Harry Potter, ...

BeautifulSoup4找不到适当的元素

问题描述

2 个解决方案

解决方案1
0 已采纳 2018-07-21 11:02:53

解决方案2
0 2018-07-21 11:18:23

BeautifulSoup4找不到适当的元素

问题描述

2 个解决方案

解决方案1 0 已采纳 2018-07-21 11:02:53

解决方案2 0 2018-07-21 11:18:23

解决方案1
0 已采纳 2018-07-21 11:02:53

解决方案2
0 2018-07-21 11:18:23