简体繁体 English

使用R从可能填充有javascript的表中抓取数据

[英]Using R to scrape data from a table populated possibly with javascript

原文 2019-03-06 04:38:13 7 1 javascript/ r/ web-scraping

Hello fellow R fanatics... 大家好，R狂热分子...

I've been using R to scrape data from a variety of websites for a while now, however this one has me stumped. 我一直在使用R从各种各样的网站上抓取数据一段时间，但是这个让我很困惑。

I am trying to scrape the data from the following table: http://www.vigimeteo.com/PREV/obs/obs_seul.html?a=07005&b= 我正在尝试从下表中抓取数据： http : //www.vigimeteo.com/PREV/obs/obs_seul.html?a=07005&b=

However my efforts thus far have failed. 但是到目前为止，我的努力失败了。

I have tried the following 我尝试了以下

Simple wget, which results in the html from the site, and some of the javascript functions used to populate the table, but I haven't been able to really look through it and find the parts that I could use to grab the data using some of R's JS utilities. 简单的wget（会从网站生成html），以及一些用于填充表格的javascript函数，但是我还无法真正浏览它并找到一些可以用来抓取数据的部分R的JS实用程序。 It might be that my experience with JS is quite poor 可能是我在JS方面的经验很差
I tried the solution here Reading data from iframe , b/c it looked like the original website had the table in an iframe, but again no luck 我在这里尝试了解决方案从iframe读取数据，b / c看起来原始网站的表位于iframe中，但是再次没有运气
A combination of getURL and readHTMLTable getURL和readHTMLTable的组合
thisURL = http://www.vigimeteo.com/PREV/obs/obs_seul.html?a=07005&b= theURL = getURL(thisURL,.opts = list(ssl.verifypeer = FALSE) ) tables = readHTMLTable(theURL)

This results in an empty table 这导致一个空表

Spent about an hour going through every part of the html and javascript code I could find, but with limited success as detailed in 1. 我花了大约一个小时来浏览我能找到的html和javascript代码的每个部分，但是成功有限，详见1。

It appears maybe R's Selenium package could have a potential solution , but I haven't yet figured out how to use it here, probably due to unfamiliarity 看来R的Selenium软件包可能有潜在的解决方案，但我可能还由于不熟悉，所以我还没有弄清楚如何使用它

I feel like I'm just missing an essential part here... perhaps due to my lack of knowledge of JS and XML? 我觉得我只是在这里缺少必要的部分……也许是由于我对JS和XML缺乏了解？

UPDATE : 更新：

I've noticed that if I right-click on the table element and use Chrome's "inspect" it generates HTML that has all of the table's values in it and would be very scrape-able... I'm still not sure how to get to this point in R though. 我注意到，如果我右键单击table元素并使用Chrome的“检查”，它会生成HTML，其中包含表的所有值，并且非常容易抓取...我仍然不确定如何在R中达到这一点。 Anyone have hints on where to look in the "inspect" screen to try and guide my progress? 任何人都可以在“检查”屏幕上找到提示，以尝试并指导我的进度？

1 个解决方案

The solution to this was the following. 解决方案如下。

Using the source code, identify the source html for the table 使用源代码，确定表的源html
Navigate to the source page, and use Chrome developer tools > Network > XHR 导航到源页面，然后使用Chrome开发者工具>网络> XHR
Refresh the page to find the source of the data 刷新页面以查找数据源
Scrape from that source 从该来源抓取

Thanks to @XR SC for his answer here: web scraping using Chrome Dev Tools for providing the basic approach. 感谢@XR SC在此提供的答案：使用Chrome开发工具提供的基本方法进行网页抓取。