简体   繁体   English

用casperjs刮文本节点的最快方法

[英]fastest way to scrape text node with casperjs

I have this structure and I need get text from plain text node like this 我有这种结构,我需要像这样从纯文本节点获取文本

<strong><font color="#666666">Phones:</font></strong>
<br>
<br>
<img src="/image/fgh.jpg" title="Velcom" alt="Velcom" style="margin: 2 5 -3 5;">
"+375 29"              //get this
<b>611 77 83</b>      //and this

I try to use XPath copied from chrome console 我尝试使用从Chrome控制台复制的XPath

casper.thenOpen('url', function() {
    result = this.getElementInfo(x('//*[@id="main_content"]/table[2]/tbody/tr[17]/td/table/tbody/tr/td[1]/p[1]/text()[3]'));
});

casper.then(function() {
    this.echo(result.text);
});

but it is not working. 但它不起作用。 Also when I try result.data 另外当我尝试result.data

console.log(this.getElementInfo(x('//*[@id="main_content"]/table[2]/tbody/tr[17]/td/table/tbody/tr/td[1]/p[1]/text()[3]')));

returns null , but this element exists in the page, I checked it out 返回null ,但是此元素存在于页面中,我将其签出

Make sure you have included: 确保您包括:

var x = require('casper').selectXPath;

If that is still not working the following will retrieve all text from page then you can parse. 如果仍然无法执行以下操作,则将从页面检索所有文本,然后可以进行解析。 This is not recommended for performance but does work if you have anchor text to parse on. 不建议这样做,以提高性能,但是如果要分析锚文本,它确实可以工作。 You will need to slightly modify. 您将需要稍作修改。

var casper = require("casper").create ({
    waitTimeout: 15000,
    stepTimeout: 15000,
    verbose: true,
    viewportSize: {
        width: 1400,
        height: 768
    },
    onWaitTimeout: function() {
        logConsole('Wait TimeOut Occured');
        this.capture('xWait_timeout.png');
        this.exit();
    },
    onStepTimeout: function() {
        logConsole('Step TimeOut Occured');
        this.capture('xStepTimeout.png');
        this.exit();
    }
});

casper.on('remote.message', function(msg) {
    logConsole('***remote message caught***: ' + msg);
});

casper.userAgent('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4');

// vars
var gUrl           = 'WebAddy'; //+++ Update URL

casper.start(gUrl, function() {
  var tPlainText = this.evaluate(function() {

    var bodyText        = document.body;
    var textContent     = bodyText.textContent || bodyText.innerText;
    var tCheck          = textContent.indexOf("Phones:");

    if (tCheck === -1) {
      tPlainText = 'Phone Text Not Found';
        return tPlainText;
    } else {
      // parse text
      var tSplit              = textContent.split('Phones:');
      var tStr                = (tSplit[1]) ? tSplit[1] : '';
      var tPos1               = tStr.indexOf(''); //+++ insert text to stop parse 
      var tDesiredText         = (tPos1 !== -1) ? tStr.substring(0, tPos1) : null;

        return tDesiredText;
    }
  });
  console.log(tPlainText);
});

casper.run();

An old question but I had the same issue. 一个老问题,但我有同样的问题。 I need to get the following text, so here is how I did it. 我需要获取以下文本,所以这是我的操作方法。

__utils__.getElementByXPath("//bla...bla/following-sibling::node()").textContent;

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM