简体   繁体   中英

HTML acquired in Python code is not the same as displayed webpage

I have recently started learning web scraping with Scrapy and as a practice, I decided to scrape a weather data table from this url .

By inspecting the table element of the page, I copy its XPath into my code but I only get an empty list when running the code. I tried to check which tables are present in the HTML using this code:

from scrapy import Selector
import requests
import pandas as pd

url = 'https://www.wunderground.com/history/monthly/OIII/date/2000-5'
html = requests.get(url).content

sel = Selector(text=html)
table = sel.xpath('//table')

It only returns one table and it is not the one I wanted.

After some research, I found out that it might have something to do with JavaScript rendering in the page source code and that Python requests can't handle JavaScript.

After going through a number of SO Q&As, I came upon a certain requests-html library which can apparently handle JS execution so I tried acquiring the table using this code snippet:

from requests_html import HTMLSession
from scrapy import Selector

session = HTMLSession()
resp = session.get('https://www.wunderground.com/history/monthly/OIII/date/2000-5')
resp.html.render()
html = resp.html.html

sel = Selector(text=html)
tables = sel.xpath('//table')

print(tables)

But the result doesn't change. How can I acquire that table?

Problem

Multiple problems may be at play here—not only javascript execution, but HTML5 APIs, cookies, user agent, etc.

Solution

Consider using Selenium with headless Chrome or Firefox web driver. Using selenium with a web driver ensures that page will be loaded as intended. Headless mode ensures that you can run your code without spawning the GUI browser—you can, of course, disable headless mode to see what's being done to the page in realtime and even add a breakpoint so that you can debug beyond pdb in the browser's console.

Example Code:


from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--headless")

driver = webdriver.Chrome(options=chrome_options)
driver.get("https://www.wunderground.com/history/monthly/OIII/date/2000-5")

tables = driver.find_elements_by_xpath('//table') # There are several APIs to locate elements available.

print(tables)

References

Selenium Github: https://github.com/SeleniumHQ/selenium

Selenium (Python) Documentation: https://selenium-python.readthedocs.io/getting-started.html

Locating Elements: https://selenium-python.readthedocs.io/locating-elements.html

you can use scrapy-splash plugin to work scrapy with Splash (scrapinghub's javascript browser)

Using splash you can render javascript and also execute user events like mouse click

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM