[英]How do I parse multiple pages?
I have been attempting to parse a sites table data into a json file, which I can do if I do each page one by one, but seeing as there are 415 pages that would take a while. 我一直试图将一个sites表数据解析成一个json文件,如果我一页一页地做每一页,我可以做,但是看到有415页,这需要一段时间。
I have seen and read a lot of StackOverflow questions on this subject but I don't seem able to modify my script so that it; 我已经阅读并阅读了许多关于此主题的StackOverflow问题,但似乎无法修改脚本,因此我无法修改它。
I believe you should be able to do this using request-promise and promise.all but I cannot figure it out. 我相信您应该可以使用request-promise和promise.all来做到这一点,但我无法弄清楚。
The actual scraping of the data is fine I just cannot make the code, scrape a page, then go to the next URL with a delay or pause inbetween requests. 实际抓取数据很好,我只是无法制作代码,抓取页面,然后在请求之间延迟或暂停而进入下一个URL。 Code below is the closest I have got, but I get the same results multiple times and I cannot slow the request rate down.
下面的代码是我获得的最接近的代码,但是我多次获得相同的结果,因此无法降低请求速率。
Example of the page URLS: 页面URLS的示例:
http://test.com/itemlist/3 etc (upto 415) http://test.com/itemlist/3等(最多415)
for (var i = 1; i <= noPages; i++) { urls.push({url: itemURL + i}); console.log(itemURL + i); } Promise.map(urls, function(obj) { return rp(obj).then(function(body) { var $ = cheerio.load(body); //Some calculations again... rows = $('table tbody tr'); $(rows).each(function(index, row) { var children = $(row).children(); var itemName = children.eq(1).text().trim(); var itemID = children.eq(2).text().trim(); var itemObj = { "id" : itemID, "name" : itemName }; itemArray.push(itemObj); }); return itemArray; }); },{concurrency : 1}).then(function(results) { console.log(results); for (var i = 0; i < results.length; i++) { // access the result's body via results[i] //console.log(results[i]); } }, function(err) { // handle all your errors here console.log(err); });
Apologies for perhaps misunderstand node.js and its modules, I don't really use the language but I needed to scrape some data and I really don't like python. 道歉也许是误解了node.js及其模块,我并没有真正使用该语言,但是我需要抓取一些数据,而且我真的不喜欢python。
since you need requests to be run only one by one Promise.all() would not help. 因为您只需要一个一个Promise.all()来运行请求,所以无济于事。 Recursive promise (I'm not sure if it's correct naming) would.
递归承诺(我不确定它的命名是否正确)。
function fetchAllPages(list) {
if (!list || !list.length) return Promise. resolve(); // trivial exit
var urlToFetch = list.pop();
return fetchPage(urlToFetch).
then(<wrapper that returns Promise will be resolved after delay >).
then(function() {
return fetchAllPages(list); // recursion!
});
}
This code still lacks error handling. 此代码仍然缺少错误处理。 Also I believe it can become much more clear with async/await:
我也相信使用async / await可以使它变得更加清晰:
for(let url of urls) {
await fetchAndProcess(url);
await <wrapper around setTimeout>;
}
but you need to find /write your own implementation of fetch()
and setTimeout()
that are async
但您需要查找/编写自己的
async
fetch()
和setTimeout()
实现
After input from @skyboyer suggesting using recursive promises I was lead to a GitHub Gist called Sequential execution of Promises using reduce() 在@skyboyer的建议使用递归诺言的输入之后,我被引导到一个名为Gise的GitHub Gist ,它使用reduce()顺序执行Promises。
Firstly I created my array of URLS 首先,我创建了我的URL数组
for (var i = 1; i <= noPages; i++) {
//example urls[0] = "http://test.com/1"
//example urls[1] = "http://test.com/2"
urls.push(itemURL + i);
console.log(itemURL + i);
}
Then 然后
var sequencePromise = urls.reduce(function(promise, url) {
return promise.then(function(results) {
//fetchIDsFromURL async function (it returns a promise in this case)
//when the promise resolves I have my page data
return fetchIDsFromURL(url)
.then(promiseWithDelay(9000))
.then(itemArr => {
results.push(itemArr);
//calling return inside the .then method will make sure the data you want is passed onto the next
return results;
});
});
}, Promise.resolve([]));
// async
function fetchIDsFromURL(url)
{
return new Promise(function(resolve, reject){
request(url, function(err,res, body){
//console.log(body);
var $ = cheerio.load(body);
rows = $('table tbody tr');
$(rows).each(function(index, row) {
var children = $(row).children();
var itemName = children.eq(1).text().trim();
var itemID = children.eq(2).text().trim();
var itemObj = {
"id" : itemID,
"name" : itemName
};
//push the 50 per page scraped items into an array and resolve with
//the array to send the data back from the promise
itemArray.push(itemObj);
});
resolve(itemArray);
});
});
}
//returns a promise that resolves after the timeout
function promiseWithDelay(ms)
{
let timeout = new Promise(function(resolve, reject){
setTimeout(function()
{
clearTimeout(timeout);
resolve();
}, ms);
});
return timeout;
}
Then finally call .then on the sequence of promises, the only issue I had with this was returning multiple arrays inside results with the same data in each, so since all data is the same in each array I just take the first one which has all my parsed items with IDs in it, then I wrote it to a JSON file. 然后最后调用.then,然后在promise序列上,我唯一遇到的问题是返回结果内部的多个数组,每个数组中的数据均相同,因此,由于每个数组中的所有数据均相同,因此我只取第一个包含所有数据的数组我解析的项目中有ID,然后将其写入JSON文件。
sequencePromise.then(function(results){
var lastResult = results.length;
console.log(results[0]);
writeToFile(results[0]);
});
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.