简体   繁体   中英

Scrape text in SPAN array for div ID using Puppeteer

I have this HTML:

<div id="ctl00_ctl00_ctl00_cphMain_cphMiddle_cphCenterColumn_uctDiveInfoDisplay_TabContainer1_tabScubeCoursesOffered_ScubaCoursesViewDIV" class="modules-wrapper">
  <table>
    <tr>
      <td><div>  <span> -Master Scuba Diver </span> </div></td>
      <td><div>  <span> -Fish Identification </span> </div></td>
    </tr>
    <tr>
      <td><div>  <span> -Underwater Navigator </span> </div></td>
      <td><div>  <span> -EFR Primary Care with AED </span> </div></td>
    </tr>
    <tr>
      <td><div>  <span> -Search & Recovery Diver </span> </div></td>
      <td><div>  <span> -Deep Diver </span> </div></td>
    </tr>
    <tr>
      <td><div>  <span> -Wreck Diver </span> </div></td>
      <td><div>  <span> -Divemaster </span> </div></td>
    </tr>
    <tr>
      <td><div>  <span> -AWARE Coral Reef Conservation </span> </div></td>
      <td><div>  <span> -PADI Seal Team </span> </div></td>
    </tr>
    <tr>
      <td><div>  <span> -Bubblemaker </span> </div></td>
      <td><div>  <span> -Advanced Open Water Diver </span> </div></td>
    </tr>
    <tr>
      <td><div>  <span> -Peak Performance Buoyancy Diver </span> </div></td>
      <td><div>  <span> -Scuba Diver </span> </div></td>
    </tr>
    <tr>
      <td><div>  <span> -Rescue Diver </span> </div></td>
      <td><div>  <span> -Discover Scuba Diving </span> </div></td>
    </tr>
    <tr>
      <td><div>  <span> -PADI Master Seal Team </span> </div></td>
      <td><div>  <span> -Project Aware </span> </div></td>
    </tr>
    <tr>
      <td><div>  <span> -Open Water Diver </span> </div></td>
      <td><div>  <span> -Adventure Diver </span> </div></td>
    </tr>
    <tr>
      <td><div>  <span> -Skin Diver </span> </div></td>
    </tr>
  </table>
</div>

I want to get the text inside each SPAN for the DIV with that ID and return each text element as an array, how do I achieve this? I have tried everything...

The easiest one-liner solution is to use page.$$eval to collect all <span> elements in the page context - it runs Array.from(document.querySelectorAll(selector)) in the background. As it returns an array of elements you are able to iterate them over with Array.map to get the innerText of each in the end.

const spanTexts = await page.$$eval('span', elements => elements.map(el => el.innerText))
console.log(spanTexts)

Within a specific <div> you can create a wildcard between the #id-name and the span element with a Universal selector ( * ). You can read more about how to create CSS selectors yourself here on MDN .

const spanTexts = await page.$$eval('#ctl00_ctl00_ctl00_cphMain_cphMiddle_cphCenterColumn_uctDiveInfoDisplay_TabContainer1_tabScubeCoursesOffered_ScubaCoursesViewDIV * span', elements => elements.map(el => el.innerText))
console.log(spanTexts)

I found the solution to get the specific ID and then get the text in the spans

const spanTexts = await page.$$eval('#ctl00_ctl00_ctl00_cphMain_cphMiddle_cphCenterColumn_uctDiveInfoDisplay_TabContainer1_tabScubeCoursesOffered_ScubaCoursesViewDIV span', elements => elements.map(el => el.innerText))

Thanks a lot @thedavidbarton

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM