简体   繁体   中英

How to scrape inside <div> list using puppeteer

I am looking for a way to efficiently scrape information formatted in the following way using puppeteer. Suppose I have a list of things on a website divided as such:

<div id="list">
  <div class="item" pos="0"> 
  <a href="www.somewebsite.com">
    <div class="nameToRetrieve"> Name 1 </div>
  </div>
  <div class="item" pos="1"> 
  <a href="www.somewebsite.com">
    <div class="nameToRetrieve"> Name 2 </div>
  </div>
  <div class="item" pos="2"> 
  <a href="www.somewebsite.com">
    <div class="nameToRetrieve"> Name 3 </div>
  </div>
</div>

How can I retrieve the information of the names (Name 1, Name 2 and Name 3?

I have tried fitting them into an object to make then into an array, but I am still confused as to how to approach it.

const listOfStuff = document.getElementById('list').getElementsByClassName('itemResult')

Not much to do with the puppeteer API I think. On modern browsers (ES6) converting to an array is elegant, and then just map it. Note I assumed nameToRetrieve only appears in stuff you want to retrieve, so no need to get the "list" .

 var names = Array.from(document.getElementsByClassName("nameToRetrieve")).map(x => x.innerHTML); console.log(names) 
 <div id="list"> <div class="item" pos="0"> <a href="www.somewebsite.com"> <div class="nameToRetrieve"> Name 1 </div> </div> <div class="item" pos="1"> <a href="www.somewebsite.com"> <div class="nameToRetrieve"> Name 2 </div> </div> <div class="item" pos="2"> <a href="www.somewebsite.com"> <div class="nameToRetrieve"> Name 3 </div> </div> </div> 

There is a special convenience method page.$$eval for this task in puppeteer:

let result = await page.$$eval('.nameToRetrieve', names => names.map(name => name.textContent));
console.log(result);

This method runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction.

The result will be:

[ ' Name 1 ', ' Name 2 ', ' Name 3 ' ]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM