刮取Javascript呈现的网页，该网页引用R中的外部JavaScript脚本

Question

I am trying to scrape this webpage: https://www.mustardbet.com/sports/events/302698 我正在尝试抓取此网页： https : //www.mustardbet.com/sports/events/302698

Since the webpage seems to be rendered dynamically, I am following this tutorial: https://www.datacamp.com/community/tutorials/scraping-javascript-generated-data-with-r#gs.dZEqev8 由于该网页似乎是动态呈现的，因此我正在关注本教程： https : //www.datacamp.com/community/tutorials/scraping-javascript-generated-data-with-r#gs.dZEqev8

As the tutorial suggests, I save a file named "scrape_mustard.js" with the following code: 正如本教程所建议的那样，我使用以下代码保存了一个名为“ scrape_mustard.js”的文件：

// scrape_mustard.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'mustard.html'

page.open('https://www.mustardbet.com/sports/events/302698', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

Then, I perform 然后，我执行

system("./phantomjs scrape_mustard.js")

but I get the error: 但是我得到了错误：

ReferenceError: Can't find variable: Set

  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1

Now, when I paste " https://www.mustardbet.com/assets/js/index.dfd873fb.js " into my browser I can see that it's javascript, and that I probably need to either (1) save that as a file, or (2) include it in scrape_mustard.js. 现在，当我将“ https://www.mustardbet.com/assets/js/index.dfd873fb.js ”粘贴到浏览器中时，我可以看到它是javascript，并且我可能需要（1）将其另存为文件，或者（2）将其包含在scrape_mustard.js中。

But if (1), I don't know how to then reference that new file, and if (2), I don't know how to define all that javascript properly so that it can be used. 但是，如果是（1），我不知道如何引用该新文件；如果是（2），我不知道如何正确定义所有这些javascript，以便可以使用它。

I'm a complete newbie to javascript, but maybe this problem is not too difficult? 我是javascript的完全新手，但也许这个问题不太难？

Thanks for your help! 谢谢你的帮助！

Answer 1

I was able to scrape using the js module puppeteer.js . 我能够使用js模块puppeteer.js进行抓取。

Download node.js here . 在此处下载node.js node.js comes with npm which makes your life easier when comes to install modules. node.js带有npm ，这使您在安装模块时更轻松。 You need to install puppeteer using npm . 您需要使用npm安装npm 。

In RStudio, make sure you are on your working directory when you are installing puppeteer.js . 在RStudio，确保你对你的工作目录，当你安装puppeteer.js 。 Once node.js is installed, do ( source ): 一旦安装了node.js ，请执行（ source ）：

system("npm i puppeteer")

scrape_mustard.js : scrape_mustard.js ：

// load modules
const fs = require("fs");
const puppeteer = require("puppeteer");

// page url
url = "https://www.mustardbet.com/sports/events/302698";

scrape = async() => {
    const browser = await puppeteer.launch({headless: false}); // open browser
    const page = await browser.newPage(); // open new page
    await page.goto(url, {waitUntil: "networkidle2", timeout: 0}); // go to page
    await page.waitFor(5000); // give it time to load all the javascript rendered content
    const html = await page.content(); // copy page contents
    browser.close(); // close chromium
    return html // return html object
};

scrape().then((value) => {
    fs.writeFileSync("./stackoverflow/page.html", value) // write the object being returned by scrape()
});

To run scrape_mustard.js in R : 要在R运行scrape_mustard.js ：

library(magrittr)

system("node ./stackoverflow/scrape_mustard.js")

html <- xml2::read_html("./stackoverflow/page.html")

oddsMajor <- html %>% 
  rvest::html_nodes(".odds-major")

betNames <- html %>% 
  rvest::html_nodes("h3")

Console output: 控制台输出：

{xml_nodeset (60)}
 [1] <span class="odds-major">2</span>
 [2] <span class="odds-major">14</span>
 [3] <span class="odds-major">15</span>
 [4] <span class="odds-major">16</span>
 [5] <span class="odds-major">17</span>
 [6] <span class="odds-major">23</span>
 [7] <span class="odds-major">25</span>
 [8] <span class="odds-major">32</span>
 [9] <span class="odds-major">33</span>
[10] <span class="odds-major">39</span>
[11] <span class="odds-major">47</span>
[12] <span class="odds-major">54</span>
[13] <span class="odds-major">55</span>
[14] <span class="odds-major">58</span>
[15] <span class="odds-major">58</span>
[16] <span class="odds-major">64</span>
[17] <span class="odds-major">73</span>
[18] <span class="odds-major">73</span>
[19] <span class="odds-major">92</span>
[20] <span class="odds-major">98</span>
...
> betNames
{xml_nodeset (60)}
 [1] <h3>Charles Howell III</h3>\n
 [2] <h3>Brian Harman</h3>\n
 [3] <h3>Austin Cook</h3>\n
 [4] <h3>J.J. Spaun</h3>\n
 [5] <h3>Webb Simpson</h3>\n
 [6] <h3>Cameron Champ</h3>\n
 [7] <h3>Peter Uihlein</h3>\n
 [8] <h3>Seung-Jae Im</h3>\n
 [9] <h3>Nick Watney</h3>\n
[10] <h3>Graeme McDowell</h3>\n
[11] <h3>Zach Johnson</h3>\n
[12] <h3>Lucas Glover</h3>\n
[13] <h3>Corey Conners</h3>\n
[14] <h3>Luke List</h3>\n
[15] <h3>David Hearn</h3>\n
[16] <h3>Adam Schenk</h3>\n
[17] <h3>Kevin Kisner</h3>\n
[18] <h3>Brian Gay</h3>\n
[19] <h3>Patton Kizzire</h3>\n
[20] <h3>Brice Garnett</h3>\n
...

I am sure it can be done with phantomjs but I've found puppeteer easier to scrape javascript-rendered webpages. 我相信这是可以做到的phantomjs但我发现puppeteer更容易刮的JavaScript渲染网页。 Also keep in mind that phantomjs is no longer being developed . 还请记住，不再开发 phantomjs 。

刮取Javascript呈现的网页，该网页引用R中的外部JavaScript脚本

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-11-16 18:08:37

刮取Javascript呈现的网页，该网页引用R中的外部JavaScript脚本

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-11-16 18:08:37

解决方案1
2 已采纳 2018-11-16 18:08:37