简体   繁体   English

Python BeautifulSoup 和 Selenium 没有抓取完整的 html

[英]Python BeautifulSoup & Selenium not scraping full html

Beginner web-scraper here.初学者网络爬虫在这里。 My practice task is simple: Collect/count a player's Pokemon usage over their last 50 games, on this page for example .我的练习任务很简单:收集/统计玩家在过去 50 场比赛中的 Pokemon 使用情况,例如在此页面上 To do this, I planned to use the image url of the Pokemon which contains the Pokemon's name (in an <img> tag, encased by <span></span> ).为此,我计划使用包含 Pokemon 名称的 Pokemon 的图像 url(在<img>标签中,由<span></span>包裹)。 Inspecting from Chrome looks like this : <img alt="Played pokemon" srcset="/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&amp;w=96&amp;q=75 1x, /_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&amp;w=256&amp;q=75 2x" ...从 Chrome 检查看起来像这样<img alt="Played pokemon" srcset="/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&amp;w=96&amp;q=75 1x, /_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&amp;w=256&amp;q=75 2x" ...

1) Using Beautiful Soup alone doesn't get the html of the images that I need: 1)单独使用 Beautiful Soup 并不能获得我需要的图像的 html:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://uniteapi.dev/p/%E3%81%BB%E3%81%B0%E3%81%A1')
wp_player = bs(r.content)
wp_player.select('span img')

2) Using Selenium picks up some of what BeautifulSoup missed: 2)使用 Selenium 弥补了 BeautifulSoup 遗漏的一些东西:

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://uniteapi.dev/p/%E3%81%BB%E3%81%B0%E3%81%A1"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
page = driver.page_source
driver.quit()

soup = bs(page, 'html.parser')
soup.select('span img')

But it gives me links that look like this : <img alt="Played pokemon" data-nimg="fixed" decoding="async" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"但它给了我看起来像这样的链接: <img alt="Played pokemon" data-nimg="fixed" decoding="async" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"

What am I misunderstanding here?我在这里有什么误解? The website I'm interested in does not have a public API, despite its name.我感兴趣的网站没有公共 API,尽管它的名字。 Any help is much appreciated.任何帮助深表感谢。

This is a common issue while web scraping websites before these gets loaded completely.这是在网站完全加载之前抓取网站时的常见问题。 What you'll have to do is basically wait for the page to fully load the images that you are requiring.您要做的基本上是等待页面完全加载您需要的图像。 You have two options, either implicit wait or explicit wait for the image elements to get loaded.您有两个选择, 隐式等待显式等待图像元素被加载。

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

url = r"https://uniteapi.dev/p/%E3%81%BB%E3%81%B0%E3%81%A1"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(executable_path='./chromedriver.exe', options=options)
driver.get(url)
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[alt="Played pokemon"]'))) # EXPLICIT WAIT
driver.implicitly_wait(10) # IMPLICIT WAIT

pokemons = driver.find_elements_by_css_selector('[alt="Played pokemon"]')
for element in pokemons:
    print(element.get_attribute("src"))

You have to choose one or the other, but it's better to explicit wait for the element(s) to get rendered before you try to access to their values.您必须选择其中一个,但最好在尝试访问它们的值之前显式等待元素呈现。

OUTPUT:输出:
pokemons = driver.find_elements_by_css_selector('[alt="Played pokemon"]') https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Tsareena.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Tsareena.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 pokemons = driver.find_elements_by_css_selector('[alt="Played pokemon"]') https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Tsareena.png&w=256&q=75 https://uniteapi.dev/_next /image?url=%2FSprites%2Ft_Square_Tsareena.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image ?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url =%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75

Your workaround wasn't working because you are doing a get request to the page that gets you the html values at their initial state, when all the DOM elements are still yet to get rendered.您的解决方法不起作用,因为您正在向页面发出获取请求,该页面在初始状态下获取 html 值,此时所有 DOM 元素仍未呈现。

The reason is that this site is using what is called Ajax to load the Pokémon dynamically via JavaScript.原因是该站点使用所谓的 Ajax 通过 JavaScript 动态加载神奇宝贝。

One thing you can do is actually observe the network tab in the debugger and look for the url that contains the data, and if you can call the url directly that returns the data you are looking for.您可以做的一件事实际上是在调试器中观察网络选项卡并查找包含数据的 url,如果您可以直接调用返回您正在查找的数据的 url。

A lot of times when web scraping, you can do this and it'll return the data in a more serialized format.很多时候,当网络抓取时,你可以这样做,它会以更序列化的格式返回数据。

Otherwise you can do as was mentioned in Sac's answer and just wait for the data to fully load.否则,您可以按照 Sac 的回答中提到的方法进行操作,然后等待数据完全加载。 Either by checking if an element has loaded yet, or just hard coding a sleep call, which is less clean.通过检查一个元素是否已经加载,或者只是硬编码一个睡眠调用,这不太干净。

Although not an answer, we specifically put in anti scraping devices in our code.虽然不是答案,但我们特意在代码中加入了防刮设备。 I would appreciate it if you did not try to scrape our website and had talked to us at our discord instead.如果您不尝试抓取我们的网站并在我们的不和谐中与我们交谈,我将不胜感激。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM