简体   繁体   English

用Phantomjs刮React站点

[英]Scraping React Site with Phantomjs

I am scraping a website that is using React components, using PhantomJS in Nodejs. 我正在使用Nodejs中的PhantomJS抓取一个使用React组件的网站。

With this: https://github.com/amir20/phantomjs-node 与此一起: https//github.com/amir20/phantomjs-node

Here is the code: 这是代码:

phantom.create().then(ph => {
    _ph = ph;
    return _ph.createPage();
}).then(page => {
    _page = page;
    return _page.open(url);
}).then(status => {
    return _page.property('content');
}).then(content => {
    console.log(content);
    _page.close();
    _ph.exit();
}).catch(e => console.log(e));

Problem is the react content is not rendered, it only says: <!-- react-empty: 1 -->" where the actual react component should be loaded. 问题是没有显示react内容,它只说: <!-- react-empty: 1 -->"应该加载实际的react组件的位置。

How can I scrap the rendered react component? 如何报废渲染的React组件? I initially switched from a pure node-request solution to PhantomJS to fix this but now I am stuck. 我最初从纯节点请求解决方案切换到PhantomJS来解决此问题,但现在我陷入了困境。


UPDATE: 更新:

So I dont have a real solution yet. 所以我还没有一个真正的解决方案。 I switched to NightmareJS ( https://github.com/segmentio/nightmare ) which has a nice .wait('.some-selector') function, which waits till the specified selector is loaded. 我切换到NightmareJS( https://github.com/segmentio/nightmare ),它具有一个不错的.wait('.some-selector')函数,该函数会等到指定的选择器加载.wait('.some-selector') This fixed my problems with dynamically loaded react components. 这解决了我动态加载的React组件的问题。

I think you should wait for rendering the react elements on the page after the page is loaded. 我认为您应该等待页面加载后在页面上呈现react元素。 An example of such a waiting-function, using Q promises, is below. 下面是使用Q承诺的此类等待功能的示例。 This function returns a promise and checks for page state every 50ms. 此函数返回一个承诺,并每50毫秒检查一次页面状态。 If the required page state is reached, the function resolves the promise. 如果达到了所需的页面状态,该函数将解析promise。 In the case of timeout, the function rejects the promise. 在超时的情况下,该函数拒绝承诺。

var phantom = require('phantom');
var Q = require('q');
var _ph, _page, _outObj;
var url = 'https://tech.yandex.ru/maps/jsbox/';

phantom.create().then(ph => {
    _ph = ph;
    return _ph.createPage();
}).then(page => {
    _page = page;
    return _page.open(url);
}).then(status => {
    console.log(status);
    return waitState(textPopulated, 3);
}).then(() => {
    return _page.property('content');
}).then(content => {
    console.log(content);
_page.close();
_ph.exit();
}).catch(e => console.log(e));

function textPopulated() {
    return _page.evaluate(function() {
        var layer = document.querySelector('.ace_text-layer');
        return layer && layer.childElementCount;
    }).then(function(childElementCount) {
        console.log('childElementCount: ' + childElementCount);
        return childElementCount > 0;
    });
}

function waitState(state, timeout) {  // timeout in seconds is optional
    console.log('Start waiting for state: ' + state.name);

    var limitTime = timeout * 1000 || 20000;
    var startTime = new Date();

    return wait();

    function wait() {
        return state().then(function(result) {
            if (result) {
                console.log('Reached state: ' + state.name);
                return;
            } else if (new Date() - startTime > limitTime) {
                var errorMessage = 'Timeout state: ' + state.name;
                console.log(errorMessage);
                throw new Error(errorMessage);
            } else {
                return Q.delay(50).then(wait);
            }
        }).catch(function(error) {
            throw error;
        });
    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM