简体   繁体   中英

Python BeautifulSoup & Selenium not scraping full html

Beginner web-scraper here. My practice task is simple: Collect/count a player's Pokemon usage over their last 50 games, on this page for example . To do this, I planned to use the image url of the Pokemon which contains the Pokemon's name (in an <img> tag, encased by <span></span> ). Inspecting from Chrome looks like this : <img alt="Played pokemon" srcset="/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&amp;w=96&amp;q=75 1x, /_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&amp;w=256&amp;q=75 2x" ...

1) Using Beautiful Soup alone doesn't get the html of the images that I need:

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://uniteapi.dev/p/%E3%81%BB%E3%81%B0%E3%81%A1')
wp_player = bs(r.content)
wp_player.select('span img')

2) Using Selenium picks up some of what BeautifulSoup missed:

from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

url = "https://uniteapi.dev/p/%E3%81%BB%E3%81%B0%E3%81%A1"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(options=options)
driver.get(url)
page = driver.page_source
driver.quit()

soup = bs(page, 'html.parser')
soup.select('span img')

But it gives me links that look like this : <img alt="Played pokemon" data-nimg="fixed" decoding="async" src="data:image/gif;base64,R0lGODlhAQABAIAAAAAAAP///yH5BAEAAAAALAAAAAABAAEAAAIBRAA7"

What am I misunderstanding here? The website I'm interested in does not have a public API, despite its name. Any help is much appreciated.

This is a common issue while web scraping websites before these gets loaded completely. What you'll have to do is basically wait for the page to fully load the images that you are requiring. You have two options, either implicit wait or explicit wait for the image elements to get loaded.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

url = r"https://uniteapi.dev/p/%E3%81%BB%E3%81%B0%E3%81%A1"
options = Options()
options.add_argument('--headless')
options.add_argument('--disable-gpu')
driver = webdriver.Chrome(executable_path='./chromedriver.exe', options=options)
driver.get(url)
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '[alt="Played pokemon"]'))) # EXPLICIT WAIT
driver.implicitly_wait(10) # IMPLICIT WAIT

pokemons = driver.find_elements_by_css_selector('[alt="Played pokemon"]')
for element in pokemons:
    print(element.get_attribute("src"))

You have to choose one or the other, but it's better to explicit wait for the element(s) to get rendered before you try to access to their values.

OUTPUT:
pokemons = driver.find_elements_by_css_selector('[alt="Played pokemon"]') https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Tsareena.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Tsareena.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75 https://uniteapi.dev/_next/image?url=%2FSprites%2Ft_Square_Snorlax.png&w=256&q=75

Your workaround wasn't working because you are doing a get request to the page that gets you the html values at their initial state, when all the DOM elements are still yet to get rendered.

The reason is that this site is using what is called Ajax to load the Pokémon dynamically via JavaScript.

One thing you can do is actually observe the network tab in the debugger and look for the url that contains the data, and if you can call the url directly that returns the data you are looking for.

A lot of times when web scraping, you can do this and it'll return the data in a more serialized format.

Otherwise you can do as was mentioned in Sac's answer and just wait for the data to fully load. Either by checking if an element has loaded yet, or just hard coding a sleep call, which is less clean.

Although not an answer, we specifically put in anti scraping devices in our code. I would appreciate it if you did not try to scrape our website and had talked to us at our discord instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM