简体   繁体   English

我如何抓取数据<canvas>元素与 python 或 javascript?

[英]How do I scrape data in <canvas> element with python or javascript?

I want to scrape data from sites like this (stat game of the game I play) where an interactive chart is being rendered in the <canvas> element and does not show any data as a scrape-able HTML element.我想从这样的网站 (我玩的游戏的统计游戏)中抓取数据,其中在<canvas>元素中呈现交互式图表,并且不将任何数据显示为可抓取的 HTML 元素。 Inspecting the HTML, the page appear to use chartjs .检查 HTML,该页面似乎使用了chartjs

Though help in python is preferred, if I really need to use some javascript, that would be fine too.虽然首选 python 中的帮助,但如果我真的需要使用一些 javascript,那也没关系。

Plus, I would like to avoid methods that require extra files such as phantomjs but again, if that's the only way, please be generous enough to share it.另外,我想避免使用 phantomjs 等需要额外文件的方法,但同样,如果这是唯一的方法,请慷慨分享。

One way to to solve this is through checking out the <script> of the page in the page source around line 1050, which is actually where the charts are initialized.解决这个问题的一种方法是通过检查页面源代码中页面的<script>大约 1050 行,这实际上是图表初始化的地方。 There's a recurring pattern in the initialization process of the charts, wherein the canvas elements are queried one by one to get their contexts, followed by the variables that offers the labels and statistics of the charts.在图表的初始化过程中有一个循环模式,其中一个一个查询画布元素以获取它们的上下文,然后是提供图表标签和统计数据的变量。

This solution covers using node.js, at least the latest version with the following modules:此解决方案涵盖使用 node.js,至少是具有以下模块的最新版本:

  • cheerio for querying elements in the DOM用于查询 DOM 中的元素的cheerio
  • axios for sending an http request to get the page source. axios用于发送 http 请求以获取页面源。
  • abstract-syntax-tree to get a javascript object tree representation of the script that we wish to scrape. abstract-syntax-tree获取我们希望抓取的脚本的 javascript 对象树表示。

Here's the solution and the source code below:下面是解决方案和源代码:

const cheerio = require('cheerio');

const axios = require('axios');

const { parse, each, find } = require('abstract-syntax-tree');

async function main() {

    // get the page source
    const { data } = await axios.get(
        'https://stats.warbrokers.io/players/i/5d2ead35d142affb05757778'
    );

    // load the page source with cheerio to query the elements
    const $ = cheerio.load(data);

    // get the script tag that contains the string 'Chart.defaults'
    const contents = $('script')
        .toArray()
        .map(script => $(script).html())
        .find(contents => contents.includes('Chart.defaults'));

    // convert the script content to an AST
    const ast = parse(contents);

    // we'll put all declarations in this object
    const declarations = {};

    // current key
    let key = null;

    // iterate over all variable declarations inside a script
    each(ast, 'VariableDeclaration', node => {

        // iterate over possible declarations, e.g. comma separated
        node.declarations.forEach(item => {

            // let's get the key to contain the values of the statistics and their labels
            // we'll use the ID of the canvas itself in this case..
            if(item.id.name === 'ctx') { // is this a canvas context variable?
                // get the only string literal that is not '2d'
                const literal = find(item, 'Literal').find(v => v.value !== '2d');
                if(literal) { // do we have non- '2d' string literals?
                    // then assign it as the current key
                    key = literal.value;
                }
            }

            // ensure that the variable we're getting is an array expression
            if(key && item.init && item.init.type === 'ArrayExpression') {

                // get the array expression
                const array = item.init.elements.map(v => v.value);

                // did we get the values from the statistics?
                if(declarations[key]) {

                    // zip the objects to associate keys and values properly
                    const result = {};
                    for(let index = 0; index < array.length; index++) {
                        result[array[index]] = declarations[key][index];
                    }
                    declarations[key] = result;

                    // let's make the key null again to avoid getting
                    // unnecessary array expression
                    key = null;

                } else {
                    // store the values
                    declarations[key] = array;
                }
            }

        });

    });

    // logging it here, it's up to you how you deal with the data itself
    console.log(declarations);

}

main();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用BeautifulSoup抓取用javascript生成的数据? - How do I scrape data generated with javascript using BeautifulSoup? 如何从 JavaScript 网站抓取数据? - How do I scrape data from JavaScript website? 我如何部分刷新我的 <canvas> 元素与JavaScript? - How do I partly refresh my <canvas> element with JavaScript? 如何使用python抓取JavaScript生成的数据 - How to scrape data generated by javascript using python 如何使用原始画布的上下文和数据更新克隆的HTML画布元素? - How do I update a cloned HTML canvas element with the context and data of the original canvas? 如何使用Python抓取不断更新的JavaScript后登录信息? - How do I scrape constantly updated JavaScript post-login using Python? 如何更改画布元素的位置? - How do I change the location of a canvas element? 我如何用 puppeteer 刮取元素标签的值 - How do i scrape the value of a element tag with puppeteer 如何使用JavaScript在画布元素上的每个笔触一次遍历一种颜色? - How do I loop through colors one at a time with each stroke on a canvas element using javascript? 如何使用Python(最好是pandas)从Javascript表中抓取数据? - How to use Python (preferably pandas) to scrape data from Javascript table?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM