[英]Moving between pages and scraping as I go with Nightmare
There is a website that contains a page with a list of 25 entries, where each entry is a link to a page containing some information that I need. 有一个网站包含一个包含25个条目的列表的网站,其中每个条目都是指向包含一些我需要的信息的页面的链接。 I want get to the listing page and then: 1) click on link to first entry 2) retrieve all the html 3) click back to the listing page (there is a button for this) 4) repeat for every other listing 我要进入列表页面,然后:1)单击指向第一个条目的链接2)检索所有html 3)单击返回到列表页面(有一个用于此的按钮)4)对其他所有列表重复
I would also like to do this as efficiently as possible which I've been told means leveraging promises. 我还想尽可能有效地做到这一点,有人告诉我这意味着要利用承诺。 Here's my code sketch, which doesn't work: 这是我的代码草图,不起作用:
var Nightmare = require('nightmare');
var nightmare = Nightmare({ openDevTools: true, show: true })
var Xray = require('x-ray');
var x = Xray();
var resultArr = [];
nightmare
.goto(hidTestURL)
.wait(2500)
.click('input[name="propertySearchOptions:advanced"]') //start navigating to listing page
.wait(2500)
.type('input[name="propertySearchOptions:streetName"]', 'Main')
.wait(2500)
.select('select[name="propertySearchOptions:recordsPerPage"]', '25')
.wait(2500)
.click('input[name="propertySearchOptions:search"]') //at listing page
.wait(2500)
.then(function(){
nightmare
.click('a[href^="Property.aspx?prop_id=228645"]') //first entry
.evaluate(function(){ //retrieve info
var resultArr = [];
resultArr.push(document.querySelector('html').innerHTML);
})
})
nightmare
.click('a[id="propertyHeading_searchResults"]') //return to listing page
.evaluate(function(){
return resultArr.push(document.querySelector('html').innerHTML); retrieve listing page info to show that it returned.
})
.then(function (resultArr) {
console.log('resultArr', resultArr);
x(resultArr[1], 'body@html') //output listing page html
.write('results.json');
})
This gets as far as the listing page, and then does not proceed any further. 这会到达列表页面,然后不再继续。 I also tried the same code, but with return nightmare
for every use of nightmare
except the first one. 我也尝试了相同的代码,但是除了第一个nightmare
外,每次nightmare
都将return nightmare
。 I'd seen some examples that used return
, but when I did this, the code threw an error. 我看到了一些使用return
示例,但是当我这样做时,代码引发了错误。
I also tried not including the third nightmare
(the one after the blank space), and instead trying to continue the old nightmare instance by going straight to the .click()
, but this also threw an error. 我还尝试不包括第三个nightmare
(空白处之后的nightmare
),而是尝试直接转到.click()
来继续旧的噩梦实例,但这也引发了错误。
I clearly need some help with the syntax and semantics of nightmare, but there is not much documentation online besides an API listing. 我显然需要噩梦的语法和语义方面的帮助,但是除了API清单之外,在线上没有多少文档。 Does anyone know how I can make this work? 有谁知道我该怎么做?
First, calling Nightmare like you have it - broken into two chains - is probably not going to do what you want. 首先,按自己的意愿打噩梦-分为两个链- 可能不会做您想要的事情。 ( This comment thread is a good - albeit long - primer.) Memory serving, actions from the second chain will be queued immediately after the first, resulting in (probably) undesirable behavior. ( 此注释线程虽然长,但却是一个很好的入门。)内存服务,第二条链中的操作将在第一条链之后立即排队,从而导致(可能)不良行为。 You said you had it written slightly differently - I'd be curious to see it, it sounds like it may have been a little closer. 您说您撰写的内容略有不同-我很好奇看到它,听起来好像已经接近一点了。
Second, you're trying to lift resultArr
in .evaluate()
, which isn't possible. 其次,您尝试在.evaluate()
提升resultArr
,这是不可能的。 The function passed to .evaluate()
is stringified and reconstituted inside of Electron - meaning that you'll lose the ambient context around the function. 传递给.evaluate()
的函数在Electron内部进行了字符串化和重构-这意味着您将失去该函数周围的环境。 This example in nightmare-examples
goes into a little more depth, if you're curious. 如果您好奇的话, 这个 nightmare-examples
会更深入一些。
Third, and maybe this is a typo or me misunderstanding intent: your href
selector uses the starts-with ( ^=
) operator, is that intentional? 第三,也许这是一个错别字或我误解的意图:您的href
选择器使用starts-with( ^=
)运算符,这是故意的吗? Should that be an ends-with ( $=
)? 那应该以( $=
)结尾吗?
Fourth, looping over asynchronous operations is tricky . 第四, 循环异步操作非常棘手 。 I get the impression that may also be a stumbling block? 我的印象可能也是绊脚石?
With all of that in mind, let's take a look at modifying your original script. 考虑到所有这些,让我们看一下修改原始脚本。 Admittedly untested, as I don't have access to your testing URL, so this is a bit from the hip: 公认未经测试,因为我无权访问您的测试URL,所以这有点时髦:
var Nightmare = require('nightmare');
var nightmare = Nightmare({ openDevTools: true, show: true })
var Xray = require('x-ray');
var x = Xray();
nightmare
.goto(hidTestURL)
.wait(2500)
.click('input[name="propertySearchOptions:advanced"]') //start navigating to listing page
.wait(2500)
.type('input[name="propertySearchOptions:streetName"]', 'Main')
.wait(2500)
.select('select[name="propertySearchOptions:recordsPerPage"]', '25')
.wait(2500)
.click('input[name="propertySearchOptions:search"]') //at listing page
.wait(2500)
.evaluate(function(){
//using `Array.from` as the DOMList is not an array, but an array-like, sort of like `arguments`
//planning on using `Array.map()` in a moment
return Array.from(
//give me all of the elements where the href contains 'Property.aspx'
document.querySelectorAll('a[href*="Property.aspx"]'))
//pull the target hrefs for those anchors
.map(a => a.href);
})
.then(function(hrefs){
//here, there are two options:
// 1. you could navigate to each link, get the information you need, then navigate back, or
// 2. you could navigate straight to each link and get the information you need.
//I'm going to go with #1 as that's how it was in your original script.
//here, we're going to use the vanilla JS way of executing a series of promises in a sequence.
//for every href in hrefs,
return hrefs.reduce(function(accumulator, href){
//return the accumulated promise results, followed by...
return accumulator.then(function(results){
return nightmare
//click on the href
.click('a[href="'+href+'"]')
//get the html
.evaluate(function(){
return document.querySelector('html').innerHTML;
})
//add the result to the results
.then(function(html){
results.push(html);
return results;
})
.then(function(results){
//click on the search result link to go back to the search result page
return nightmare
.click('a[id="propertyHeading_searchResults"]')
.then(function() {
//make sure the results are returned
return results;
});
})
});
}, Promise.resolve([])) //kick off the reduce with a promise that resolves an empty array
})
.then(function (resultArr) {
//if I haven't made a mistake above with the `Array.reduce`, `resultArr` should now contain all of your links' results
console.log('resultArr', resultArr);
x(resultArr[1], 'body@html') //output listing page html
.write('results.json');
});
Hopefully that's enough to get you started. 希望这足以让您入门。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.