简体   繁体   English

我在做噩梦时在页面之间移动和抓取

[英]Moving between pages and scraping as I go with Nightmare

There is a website that contains a page with a list of 25 entries, where each entry is a link to a page containing some information that I need. 有一个网站包含一个包含25个条目的列表的网站,其中每个条目都是指向包含一些我需要的信息的页面的链接。 I want get to the listing page and then: 1) click on link to first entry 2) retrieve all the html 3) click back to the listing page (there is a button for this) 4) repeat for every other listing 我要进入列表页面,然后:1)单击指向第一个条目的链接2)检索所有html 3)单击返回到列表页面(有一个用于此的按钮)4)对其他所有列表重复

I would also like to do this as efficiently as possible which I've been told means leveraging promises. 我还想尽可能有效地做到这一点,有人告诉我这意味着要利用承诺。 Here's my code sketch, which doesn't work: 这是我的代码草图,不起作用:

var Nightmare = require('nightmare');
var nightmare = Nightmare({ openDevTools: true, show: true })
var Xray = require('x-ray');
var x = Xray();
var resultArr = [];

nightmare
.goto(hidTestURL)
.wait(2500)
.click('input[name="propertySearchOptions:advanced"]') //start navigating to listing page
.wait(2500)
.type('input[name="propertySearchOptions:streetName"]', 'Main')
.wait(2500)
.select('select[name="propertySearchOptions:recordsPerPage"]', '25')
.wait(2500)
.click('input[name="propertySearchOptions:search"]') //at listing page
.wait(2500)
.then(function(){
  nightmare
  .click('a[href^="Property.aspx?prop_id=228645"]') //first entry
  .evaluate(function(){ //retrieve info
    var resultArr = [];
    resultArr.push(document.querySelector('html').innerHTML);
  })
})

nightmare
.click('a[id="propertyHeading_searchResults"]') //return to listing page
.evaluate(function(){
  return resultArr.push(document.querySelector('html').innerHTML); retrieve listing page info to show that it returned.
})
.then(function (resultArr) {
  console.log('resultArr', resultArr);
  x(resultArr[1], 'body@html') //output listing page html
    .write('results.json');
})

This gets as far as the listing page, and then does not proceed any further. 这会到达列表页面,然后不再继续。 I also tried the same code, but with return nightmare for every use of nightmare except the first one. 我也尝试了相同的代码,但是除了第一个nightmare外,每次nightmare都将return nightmare I'd seen some examples that used return , but when I did this, the code threw an error. 我看到了一些使用return示例,但是当我这样做时,代码引发了错误。

I also tried not including the third nightmare (the one after the blank space), and instead trying to continue the old nightmare instance by going straight to the .click() , but this also threw an error. 我还尝试不包括第三个nightmare (空白处之后的nightmare ),而是尝试直接转到.click()来继续旧的噩梦实例,但这也引发了错误。

I clearly need some help with the syntax and semantics of nightmare, but there is not much documentation online besides an API listing. 我显然需要噩梦的语法和语义方面的帮助,但是除了API清单之外,在线上没有多少文档。 Does anyone know how I can make this work? 有谁知道我该怎么做?

First, calling Nightmare like you have it - broken into two chains - is probably not going to do what you want. 首先,按自己的意愿打噩梦-分为两个链- 可能不会做您想要的事情。 ( This comment thread is a good - albeit long - primer.) Memory serving, actions from the second chain will be queued immediately after the first, resulting in (probably) undesirable behavior. 此注释线程虽然长,但却是一个很好的入门。)内存服务,第二条链中的操作将在第一条链之后立即排队,从而导致(可能)不良行为。 You said you had it written slightly differently - I'd be curious to see it, it sounds like it may have been a little closer. 您说您撰写的内容略有不同-我很好奇看到它,听起来好像已经接近一点了。

Second, you're trying to lift resultArr in .evaluate() , which isn't possible. 其次,您尝试在.evaluate()提升resultArr ,这是不可能的。 The function passed to .evaluate() is stringified and reconstituted inside of Electron - meaning that you'll lose the ambient context around the function. 传递给.evaluate()的函数在Electron内部进行了字符串化和重构-这意味着您将失去该函数周围的环境。 This example in nightmare-examples goes into a little more depth, if you're curious. 如果您好奇的话, 这个 nightmare-examples会更深入一些。

Third, and maybe this is a typo or me misunderstanding intent: your href selector uses the starts-with ( ^= ) operator, is that intentional? 第三,也许这是一个错别字或我误解的意图:您的href选择器使用starts-with( ^= )运算符,这是故意的吗? Should that be an ends-with ( $= )? 那应该以( $= )结尾吗?

Fourth, looping over asynchronous operations is tricky . 第四, 循环异步操作非常棘手 I get the impression that may also be a stumbling block? 我的印象可能也是绊脚石?

With all of that in mind, let's take a look at modifying your original script. 考虑到所有这些,让我们看一下修改原始脚本。 Admittedly untested, as I don't have access to your testing URL, so this is a bit from the hip: 公认未经测试,因为我无权访问您的测试U​​RL,所以这有点时髦:

var Nightmare = require('nightmare');
var nightmare = Nightmare({ openDevTools: true, show: true })
var Xray = require('x-ray');
var x = Xray();

nightmare
.goto(hidTestURL)
.wait(2500)
.click('input[name="propertySearchOptions:advanced"]') //start navigating to listing page
.wait(2500)
.type('input[name="propertySearchOptions:streetName"]', 'Main')
.wait(2500)
.select('select[name="propertySearchOptions:recordsPerPage"]', '25')
.wait(2500)
.click('input[name="propertySearchOptions:search"]') //at listing page
.wait(2500)
.evaluate(function(){
  //using `Array.from` as the DOMList is not an array, but an array-like, sort of like `arguments`
  //planning on using `Array.map()` in a moment
  return Array.from(
    //give me all of the elements where the href contains 'Property.aspx'
    document.querySelectorAll('a[href*="Property.aspx"]'))
    //pull the target hrefs for those anchors
    .map(a => a.href);
})
.then(function(hrefs){
  //here, there are two options:
  //  1. you could navigate to each link, get the information you need, then navigate back, or
  //  2. you could navigate straight to each link and get the information you need.
  //I'm going to go with #1 as that's how it was in your original script.

  //here, we're going to use the vanilla JS way of executing a series of promises in a sequence.
  //for every href in hrefs,
  return hrefs.reduce(function(accumulator, href){
    //return the accumulated promise results, followed by...
    return accumulator.then(function(results){
      return nightmare
        //click on the href
        .click('a[href="'+href+'"]')
        //get the html
        .evaluate(function(){
          return document.querySelector('html').innerHTML;
        })
        //add the result to the results
        .then(function(html){
          results.push(html);
          return results;
        })
        .then(function(results){
          //click on the search result link to go back to the search result page
          return nightmare
            .click('a[id="propertyHeading_searchResults"]')
            .then(function() {
              //make sure the results are returned
              return results;
            });
        })
    });
  }, Promise.resolve([])) //kick off the reduce with a promise that resolves an empty array
})
.then(function (resultArr) {
  //if I haven't made a mistake above with the `Array.reduce`, `resultArr` should now contain all of your links' results
  console.log('resultArr', resultArr);
  x(resultArr[1], 'body@html') //output listing page html
    .write('results.json');
});

Hopefully that's enough to get you started. 希望这足以让您入门。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM