简体   繁体   English

美丽的汤和findAll()过程

[英]Beautiful Soup and the findAll() process

I am attempting to scrape data from a site using the following code. 我正在尝试使用以下代码从网站上抓取数据。 The site required the decode method and I followed a @royatirek solution. 该站点需要解码方法,我遵循了@royatirek解决方案。 My problem is that container_a ends up containing nothing. 我的问题是container_a最终不包含任何内容。 I use a similar method on few other sites and it works. 我在其他一些网站上也使用了类似的方法,并且可以正常工作。 But on this and a couple of other sites my container_a variable remains an empty list. 但是在此站点和其他几个站点上,我的container_a变量仍然为空列表。 Cheers 干杯

from urllib.request import Request, urlopen
from bs4 import BeautifulSoup as soup
my_url = 'http://www.news.com.au/sport/afl-round-3-teams-full-lineups-and- 
the-best-supercoach-advice/news-story/dfbe9e0e68d445e07c9522a138a2b824'
req = Request(my_url, headers={'User-Agent': 'Mozilla/5.0'})
web_byte = urlopen(req).read()
webpage = web_byte.decode('utf-8')
page_soup = soup(web_byte, "html.parser")
container_a = page_soup.findAll("div",{"class":"fyre-comment-wrapper"})

The content you want to parse is being dynamically loaded by JavaScript and therefore requests won't do the job for you. 您要解析的内容正在由JavaScript动态加载,因此requests不会为您完成工作。 You could use selenium and ChromeDriver or any other driver for that: 您可以ChromeDriver使用seleniumChromeDriver或任何其他驱动程序:

from selenium import webdriver
from bs4 import BeautifulSoup

driver = webdriver.Chrome()
driver.get("http://www.news.com.au/sport/afl-round-3-teams-full-lineups-and-the-best-supercoach-advice/news-story/dfbe9e0e68d445e07c9522a138a2b824")

You can then proceed with the use of bs4 as you wish by accessing the page source using .page_source : 然后,您可以使用.page_source访问页面源,从而根据需要继续使用.page_source

page_soup = BeautifulSoup(driver.page_source, "html.parser")
container_a = page_soup.findAll("div",{"class":"fyre-comment-wrapper"})

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM