繁体   English   中英

如何从无限滚动页面的 DOM 中持续抓取数据?

[英]How to continuously scrape data from the DOM of a page that comes with infinite scrolling?

有一个 web 页面,我想从中抓取一些信息。

我从收集一堆 HTML 元素开始。

var theSearch = document.getElementsByClassName('theID');

然后我将那个 HTML 集合变成一个数组。

var arr = Array.prototype.slice.call( theSearch );

现在是棘手的部分。

我想向下滚动页面,并抓取页面上出现的新项目。

window.scrollTo(0, document.body.scrollHeight);

如何访问新插入的 DOM 节点? 就像是...

var theSearch2 = document.getElementsByClassName('theID');

...并将其转换为新数组...

var arr2 = Array.prototype.slice.call( theSearch );

...并将项目从arr2推送到arr之类的...

arr.push(...arr2);

以及如何实现一个持续的过程,该过程一直在抓取直到没有新项目附加到页面的 DOM 中。

OP 可能会查看MutationObserver 每当新项目呈现到 DOM 中(由滚动触发)时, 观察者的callback都会收到一个MutationRecord实例列表,OP 可以对其进行操作。

 function handleChildlistChanges(mutationList/*, observer*/) { mutationList.forEach(mutation => { const { type, addedNodes } = mutation; if (type === 'childList') { // one or more children have been added to // and/or removed from the tree. scrapedContentNodes.push(...addedNodes); console.log({ scrapedContentNodes }); } }); } const scrapedContentNodes = []; const options = { //attributes: true, childList: true, //subtree: true, }; const target = document.querySelector('#items'); const observer = new MutationObserver(handleChildlistChanges); observer.observe(target, options); // test case... creating content. ['the quick', 'brown fox', 'jumped over', 'the lazy dog.'].reduce((parentNode, content, idx) => { const contentNode = document.createElement('p'); contentNode.appendChild( document.createTextNode(content) ); setTimeout( () => parentNode.appendChild(contentNode), 600 * idx, ); return parentNode; }, target);
 .as-console-wrapper { left: auto;important: width; 70%: min-height; 100%; }
 <div id="items"> </div>

变异观察者

MutationObserver 接口提供了监视对 DOM 树所做更改的能力。

   var observer = new MutationObserver(function (mutations) {
      mutations.forEach(function (mutation) {
        mutation.addedNodes.forEach(function (addedNode) {
          console.log(addedNode, "@@@"); // your new item
        });
      });
    });

    observer.observe(document.getElementById("lists"), {
      childList: true,
      subtree: false
    });

试试这个:

 window.addEventListener('load', function() { var count = 0; function addListItem() { console.log("called"); const ul = document.getElementById("lists"); var li = document.createElement("li"); li.setAttribute("class", "item"); ul.appendChild(li); li.innerHTML = li.innerHTML + Math.floor(Math.random() * 10); count++; if(count > 5) { myStopFunction() } } myInterval = setInterval(addListItem, 2000); function myStopFunction() { clearInterval(myInterval); } // HERE IS THE SOLUTION var observer = new MutationObserver(function (mutations) { mutations.forEach(function (mutation) { mutation.addedNodes.forEach(function (addedNode) { console.log(addedNode, "@@@"); // your new item }); }); }); observer.observe(document.getElementById("lists"), { childList: true, subtree: false }); })
 <!DOCTYPE html> <html> <head> <title>Parcel Sandbox</title> <meta charset="UTF-8" /> </head> <body> <div class="list-container"> <ul id="lists"> <li class="list-item">Rand</li> </ul> </div> </body> </html>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM