简体   繁体   中英

Scrape text from a complex DOM structure

Consider the following hierarchy in DOM

<div class="bodyCells">
    <div style="foo">
       <div style="foo">
           <div style="foo1"> 'contains the list of text elements I want to scrape' </div>
           <div style="foo2"> 'contains the list of text elements I want to scrape' </div>
       </div>
       <div style="foo">
           <div style="foo3"> 'contains the list of text elements I want to scrape' </div>
           <div style="foo4"> 'contains the list of text elements I want to scrape' </div>
       </div>

By using class name bodyCells , I need to scrape out the data from each of the divs one at a time (ie) Initially from 1st div, then from the next div and so on and store it in separate arrays. How can I possibly achieve this? (using puppeteer)

NOTE: I have tried using class name directly to achieve this but, it gives all the texts in a single array. I need to get data from each tag separately in different arrays.

Expected output:

array1=["text present within style="foo1" div tag"] 
array2=["text present within style="foo2" div tag"] 
array3=["text present within style="foo3" div tag"]
array4=["text present within style="foo4" div tag"]

As you noted, you can fetch each of the texts in a single array using the class name. Next, if you iterate over each of those, you can create a separate array for each subsection.

I created a fiddle here - https://jsfiddle.net/32bnoey6/ - with this example code:

const cells = document.getElementsByClassName('bodyCells');

const scrapedElements = [];
for (var i = 0; i < cells.length; i++) {
    const item = cells[i];
  for (var j = 0; j < item.children.length; j++) {
    const outerDiv = item.children[j];
    const innerDivs = outerDiv.children;
    for (var k = 0; k < innerDivs.length; k++) {
        const targetDiv = innerDivs[k];
      scrapedElements.push([targetDiv.innerHTML]);
    }
  }
}

console.log(scrapedElements);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM