[英]How to continuously scrape data from the DOM of a page that comes with infinite scrolling?
有一个 web 页面,我想从中抓取一些信息。
我从收集一堆 HTML 元素开始。
var theSearch = document.getElementsByClassName('theID');
然后我将那个 HTML 集合变成一个数组。
var arr = Array.prototype.slice.call( theSearch );
现在是棘手的部分。
我想向下滚动页面,并抓取页面上出现的新项目。
window.scrollTo(0, document.body.scrollHeight);
如何访问新插入的 DOM 节点? 就像是...
var theSearch2 = document.getElementsByClassName('theID');
...并将其转换为新数组...
var arr2 = Array.prototype.slice.call( theSearch );
...并将项目从arr2
推送到arr
之类的...
arr.push(...arr2);
以及如何实现一个持续的过程,该过程一直在抓取直到没有新项目附加到页面的 DOM 中。
OP 可能会查看MutationObserver
。 每当新项目呈现到 DOM 中(由滚动触发)时, 观察者的callback
都会收到一个MutationRecord
实例列表,OP 可以对其进行操作。
function handleChildlistChanges(mutationList/*, observer*/) { mutationList.forEach(mutation => { const { type, addedNodes } = mutation; if (type === 'childList') { // one or more children have been added to // and/or removed from the tree. scrapedContentNodes.push(...addedNodes); console.log({ scrapedContentNodes }); } }); } const scrapedContentNodes = []; const options = { //attributes: true, childList: true, //subtree: true, }; const target = document.querySelector('#items'); const observer = new MutationObserver(handleChildlistChanges); observer.observe(target, options); // test case... creating content. ['the quick', 'brown fox', 'jumped over', 'the lazy dog.'].reduce((parentNode, content, idx) => { const contentNode = document.createElement('p'); contentNode.appendChild( document.createTextNode(content) ); setTimeout( () => parentNode.appendChild(contentNode), 600 * idx, ); return parentNode; }, target);
.as-console-wrapper { left: auto;important: width; 70%: min-height; 100%; }
<div id="items"> </div>
MutationObserver 接口提供了监视对 DOM 树所做更改的能力。
var observer = new MutationObserver(function (mutations) {
mutations.forEach(function (mutation) {
mutation.addedNodes.forEach(function (addedNode) {
console.log(addedNode, "@@@"); // your new item
});
});
});
observer.observe(document.getElementById("lists"), {
childList: true,
subtree: false
});
window.addEventListener('load', function() { var count = 0; function addListItem() { console.log("called"); const ul = document.getElementById("lists"); var li = document.createElement("li"); li.setAttribute("class", "item"); ul.appendChild(li); li.innerHTML = li.innerHTML + Math.floor(Math.random() * 10); count++; if(count > 5) { myStopFunction() } } myInterval = setInterval(addListItem, 2000); function myStopFunction() { clearInterval(myInterval); } // HERE IS THE SOLUTION var observer = new MutationObserver(function (mutations) { mutations.forEach(function (mutation) { mutation.addedNodes.forEach(function (addedNode) { console.log(addedNode, "@@@"); // your new item }); }); }); observer.observe(document.getElementById("lists"), { childList: true, subtree: false }); })
<!DOCTYPE html> <html> <head> <title>Parcel Sandbox</title> <meta charset="UTF-8" /> </head> <body> <div class="list-container"> <ul id="lists"> <li class="list-item">Rand</li> </ul> </div> </body> </html>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.