简体   繁体   English

如何通过CasperJS从具有特定类的元素中获取hrefs?

[英]How to get hrefs from from elements with a certain class through CasperJS?

I've got a web page with information in this format: 我有一个网页,其中包含以下格式的信息:

<p>
    <a class="class1" href="href1">text1</a>
    text2
</p>

<p>
    <a class="class1" href="href2">text1a</a>
    text2a
</p>

Using CasperJS, I need to get an array of all the information contained in just in elements with class1 in this format: 使用CasperJS,我需要以这种格式获取仅包含在class1元素中的所有信息的数组:

href1
text1
text2

href2
text1a
text2a

I've tried using this code: 我尝试使用此代码:

var casper = require('casper').create();
casper.start('url', function() {
    require('utils').dump(this.getElementsAttribute('div[class="class1"]', 
          'class'));
});
casper.run();

However, I just got a '[ ]' as an answer. 但是,我只是得到“ []”作为答案。

Can anybody help me find the error in my code? 有人可以帮助我在代码中找到错误吗?

div[class="class1"] as a selector cannot work, because you don't have any <div> elements in your markup that have the class1 class. div[class="class1"]作为选择器无法使用,因为您的标记中没有任何具有class1类的<div>元素。 You can try the following, but it won't get you far: 您可以尝试以下方法,但是它不会帮助您:

this.getElementsAttribute('a.class1', 'href');

Building an array of objects in the page context 在页面上下文中构建对象数组

It is hard and may be error prone to do this only with CasperJS functions. 仅使用CasperJS函数很难做到这一点,并且可能容易出错。 It is much easier to do this by iterating over all the links and fetching the parts that you need. 通过遍历所有链接并获取所需的零件,可以轻松得多。

casper.then(function(){
    var info = this.evaluate(function(){
        var links = document.querySelectorAll(".class1");
        // iterate over links and collect stuff
        return Array.prototype.map.call(links, function(link){
            return {
                href: link.href,
                hrefText: link.textContent.trim(),
                afterText: link.parentNode.childNodes[2].textContent.trim()
            };
        });
    });
    require('utils').dump(info);
});

How this works: 工作原理:

You can get all the links by querying all for all the elements with class1 . 您可以通过查询所有具有class1所有元素来获取所有链接。 Since the result of querySelectorAll() is not an array, but an array-like NodeList, you can't directly use .map() on it. 由于querySelectorAll()的结果不是数组,而是类似数组的NodeList,因此不能直接在其上使用.map()

Each link has a href property and a textContent property. 每个链接都有一个href属性和一个textContent属性。 The text after the link is a little tricky. 链接后的文本有些棘手。 You first need to get the parent of the link ( <p> ) and then try to get the TextNode after the link by accessing the childNodes property. 您首先需要获取链接的父节点( <p> ),然后尝试通过访问childNodes属性来获取链接后的TextNode。

childNodes[2] must probably be used instead of childNodes[1] , because the first ( childNodes[0] ) is probably a TextNode containing whitespace, so everything after it shifts. 可能必须使用childNodes[2]而不是childNodes[1] ,因为第一个( childNodes[0] )可能是一个包含空格的TextNode,因此其后的所有内容都会转移。

Building a single string in the page context 在页面上下文中构建单个字符串

You can also iterate over it to get it in a textual representation: 您也可以对其进行迭代以使其以文本形式表示:

casper.then(function(){
    var info = this.evaluate(function(){
        var links = document.querySelectorAll(".class1");
        // iterate over links and collect stuff
        return Array.prototype.map.call(links, function(link){
            return [
                link.href,
                link.textContent.trim(),
                link.parentNode.childNodes[2].textContent.trim()
            ].join('\n');
        }).join('\n\n');
    });
    this.echo(info);
});

How it works: 这个怎么运作:

JavaScript arrays have a join() function . JavaScript数组具有join()函数 It can joins every element using the specified separator. 它可以使用指定的分隔符连接每个元素。


Keep in mind that the page context ( evaluate() ) is sandboxed. 请记住,页面上下文( evaluate() )已沙盒化。 The documentation says: 文件说:

Note: The arguments and the return value to the evaluate function must be a simple primitive object. 注意:参数和evaluate函数的返回值必须是一个简单的原始对象。 The rule of thumb: if it can be serialized via JSON, then it is fine. 经验法则:如果可以通过JSON序列化,那就很好。

Closures, functions, DOM nodes, etc. will not work! 闭包功能,DOM节点等等都不行!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM