简体   繁体   中英

Scraping data between two classes

I'm using cheerio to get informations from a specific website.

Website source code:

网站源代码

<body>
 <div class="row">
  <div class="container">
   <div class="row">
    <div id="content" class="col-xl-9 col-lg-8 col-12 p-4">
     <div class="row" id="box">
      <div class="col-12">...</div>
      <div class="col-xl-4">...</div>
      <div class="col-xl-4">...</div>
      <div class="col-12">...</div>
      <div class="col-xl-4">...</div>
     </div>
    </div>
   </div>
  </div>
 </div>
</body>

Now I wanna count all ".col-xl-4" classes between the two ".col-12" classes.

My current way to get every ".col-xl-4" class even after the second ".col-12" class looks like this:

console.log($('.row > .col-xl-4', html).get().length)

How am I able to get this?

Here's one scheme that iterates the .row > div elements and uses a state machine to go through these four states:

  1. Looking for first "col-12"
  2. Looking for first "col-xl-4" after that
  3. Counting successive "col-xl-4" items
  4. Done counting

Here's a sample implementation that runs in node.js:

const cheerio = require('cheerio');

const sampleHTML = `
    <div class="row">
        <div class="col-12"></div>
        <div class="col-xl-4"></div>
        <div class="col-xl-4"></div>
        <div class="col-xl-4"></div>
        <div class="col-12"></div>
        <div class="col-12"></div>
        <div class="col-12"></div>
    </div>
`;

let $ = cheerio.load(sampleHTML);
let divs = $(".row > div");
let state = "looking-col-12";
let cnt = 0;
divs.each((i, div) => {
    let item = $(div);
    switch (state) {
        case "looking-col-12":
            if (item.hasClass("col-12")) {
                state = "lookingFirst-xl-4";
            }
            break;
        case "lookingFirst-xl-4":
            if (item.hasClass("col-xl-4")) {
                state = "counting-xl-4";
                cnt = 1;
            }
            break;
        case "counting-xl-4":
            if (item.hasClass("col-xl-4")) {
                ++cnt;
            } else {
                state = "done";
            }
            break;
        default:
            break;
    }
});
console.log(cnt);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM