简体   繁体   English

使用 Python Selenium 使用画布元素抓取网站

[英]Scraping website with canvas elements using Python Selenium

I'd really use some help with scraping the data from the line or donut charts on this website .我真的会使用一些帮助来从本网站上的折线图或圆环图上抓取数据。 I need this data for a study project focusing on forecasting solar and wind production in the Netherlands.我需要这些数据用于一个专注于预测荷兰太阳能和风能生产的研究项目。

I'd like to use Python for the task and I'd attempted doing so using Selenium.我想使用 Python 来完成这项任务,并且我曾尝试使用 Selenium 这样做。

Data is stored in canvas elements, which makes this a bit more challenging than expected and I'd use some help with figuring out the right approach to extract the data.数据存储在画布元素中,这使得这比预期的更具挑战性,我会使用一些帮助来找出正确的方法来提取数据。 Any help doing this would be much appreciated.对此的任何帮助将不胜感激。

My approach till now has been to locate the line-chart element and then 'move the mouse' (using Selenium Actions and move_to_element_with_offset function) over the charts from left to right.到目前为止,我的方法是找到折线图元素,然后在图表上从左到右“移动鼠标”(使用 Selenium Actions 和 move_to_element_with_offset 函数)。

For each step, I'd record the data that will be available in the hover text and somehow link that to the right timestamp.对于每一步,我都会记录悬停文本中可用的数据,并以某种方式将其链接到正确的时间戳。

See here for a screen-shot of how it looks in my browser.有关它在我的浏览器中的外观的屏幕截图,请参见此处。 Note how the Zonne energie data value appears in the div below when hovering :请注意悬停时 Zonne 能量数据值如何出现在下面的 div 中:

它在浏览器中的外观

The problem is, however, that I'm not able to receive the data in the page source.但是,问题是我无法接收页面源中的数据。 Probably because I'm not not able to figure out how to hover the mouse over the chart using Selenium.可能是因为我无法弄清楚如何使用 Selenium 将鼠标悬停在图表上。

My initial code is:我的初始代码是:

chrome_driver_path = pathlib.Path(__file__).parent / "chromedriver"
options = webdriver.ChromeOptions()
options.add_argument('headless')
driver = webdriver.Chrome(executable_path=chrome_driver_path,options=options)
url = "https://energieopwek.nl"
driver.get(url)

line_chart=driver.find_element(By.ID,"linechart_1")
action.move_to_element(line_chart).click().perform() # clicking on the chart
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup.prettify()) # I'd expect to see the data in the page source, but it's not

Here is the page source output.这是页面源输出。 I'd have expected data from the chart to be present in the divs, as in the screen-shot above:我希望图表中的数据会出现在 div 中,如上面的屏幕截图所示:

 <div _echarts_instance_="ec_1652165210746" class="eo-chart" id="linechart_1" style="-webkit-tap-highlight-color: transparent; user-select: none; position: relative; background: rgba(0, 0, 0, 0);"> <div style="position: relative; overflow: hidden; width: 744px; height: 385px; padding: 0px; margin: 0px; border-width: 0px; cursor: default;"> <canvas data-zr-dom-id="zr_0" height="385" style="position: absolute; left: 0px; top: 0px; width: 744px; height: 385px; user-select: none; -webkit-tap-highlight-color: rgba(0, 0, 0, 0); padding: 0px; margin: 0px; border-width: 0px;" width="744"> </canvas> </div> <div> --- WHERE IS THE DATA?--- </div> </div>

Curious to hear if anybody is able to help me here ?想知道是否有人能在这里帮助我?

If this is for a project you are going to publish you should reach to the source asking for permission, or get lawyers involved to make sure you are not breaking the Terms of service on that site.如果这是针对您要发布的项目,您应该联系来源请求许可,或让律师参与,以确保您没有违反该网站上的服务条款。 I get a feeling they might have obfuscated the data to prevent what you are trying to do.我感觉他们可能混淆了数据以阻止您尝试做的事情。


About my comment and the data available on:关于我的评论和可用的数据:
https://energieopwek.nl/data.php?sid=2ecde3&Day=2022-05-05&scale=day https://energieopwek.nl/data.php?sid=2ecde3&Day=2022-05-05&scale=day

Even with the JS code uglyfied we can still make up some:即使 JS 代码被丑化了,我们仍然可以编造一些:
... that return seriesData caught my eye, looks like that is the raw data for the chart ... return seriesData引起了我的注意,看起来那是图表的原始数据

If you know how to use debug on the developer console that is your start point如果您知道如何在作为起点的开发者控制台上使用调试

And it looks like there is a way to read JS variables from selenium if that is what you prefer using:如果您喜欢使用这种方法,那么看起来有一种方法可以从 selenium 中读取 JS 变量:
Reading JavaScript variables using Selenium WebDriver 使用 Selenium WebDriver 读取 JavaScript 变量


You can take a screenshot with selenium then crop it automatically.您可以使用 selenium 截取屏幕截图,然后自动裁剪。 Here's an example of something like that I've done before.这是我以前做过的类似事情的一个例子。

element = driver.find_element_by_xpath('//*[@id="THIS_WEEK"]')
location = element.location
size = element.size
driver.save_screenshot("due.png")
x = location['x']
y = location['y']
w = size['width']
h = size['height']
width = x + w
height = y + h
im = Image.open('due.png')
im = im.crop((int(x), int(y), int(width), int(height)))
im.save('due.png')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM