简体   繁体   English

遍历 div 元素

[英]iterate through div elements

I'm a complete beginner in javascript and web scraping using puppeteer and I am trying to get the scores of a simple euroleague round in https://www.euroleague.net/main/results?gamenumber=28&phasetypecode=RS&seasoncode=E2019 I'm a complete beginner in javascript and web scraping using puppeteer and I am trying to get the scores of a simple euroleague round in https://www.euroleague.net/main/results?gamenumber=28&phasetypecode=RS&seasoncode=E2019

在此处输入图像描述

By inspecting the score list above I find out that the score list is a div element containing other divs inside with the stats displayed.通过inspecting上面的分数列表,我发现分数列表是一个div元素,其中包含其他divs并显示统计信息。

HTML for a single match between 2 teams (there are more divs for matches below this example ) HTML用于 2 支球队之间的单场比赛(此示例下方的比赛有更多 div)

 //score list <div class="wp-module wp-module-asidegames wp-module-5lfarqnjesnirthi"> //the data-code increases to "euro_245"... <div class=""> <div class="game played" data-code="euro_244" data-date="1583427600000" data-played="1"> <a href="/main/results/showgame?gamecode=244&amp;seasoncode=E2019" class="game-link"> <div class="club"> <span class="name">Zenit St Petersburg</span> <span class="score homepts winner">76</span> </div> <div class="club"> <span class="name">Zalgiris Kaunas</span> <span class="score awaypts ">75</span> </div> <div class="info"> <span class="date">March 5 18:00 CET</span> <span class="live"> LIVE <span class="minute"></span> </span> <span class="final"> FINAL </span> </div> </a> </div> //more teams </div> </div>

What I want is to iterate through the outer div element and get the teams playing and the score of each match and store them in a json file.我想要的是遍历外部 div 元素,让球队参加比赛和每场比赛的得分,并将它们存储在 json 文件中。 However since I am a complete beginner I do not understand how to iterate through the html above.但是,由于我是一个完整的初学者,我不明白如何遍历上面的 html。 This is my web scraping code to get the element:这是我的 web 抓取代码来获取元素:

 const puppeteer = require('puppeteer'); const sleep = (delay) => new Promise((resolve) => setTimeout(resolve,delay)); async function getTeams(url){ const browser = await puppeteer.launch(); const page = await browser.newPage(); await page.goto(url); await sleep(3000); const games = await page.$x('//*[@id="main-one"]/div/div/div/div[1]/div[1]/div[3]'); //this is where I will execute the iteration part to get the matches with their scores await sleep(2000); await browser.close(); } getTeams('https://www.euroleague.net/main/results?gamenumber=28&phasetypecode=RS&seasoncode=E2019');

I would appreciate your help with guiding me through the iteration part.感谢您帮助指导我完成迭代部分。 Thank you in advance先感谢您

The most accurate selector for a game box is div.game.played (a div which both has the .game and the .played CSS classes), you will need to count the elements that match this criteria.游戏框最准确的选择器是div.game.played (一个同时具有.game.played CSS 类的 div),您需要计算符合此条件的元素。 It is possible with page.$$eval ( page.$$eval (selector, pageFunction[, ...args]) ) which runs Array.from(document.querySelectorAll(selector)) within the page and passes it as the first argument to pageFunction . page.$$eval ( page.$$eval (selector, pageFunction[, ...args]) ) 可以在页面内运行Array.from(document.querySelectorAll(selector))并将其作为第一个传递pageFunction的参数。

As we are using the element indexes for the specific data fields we run a regular for loop with the length of the elements.当我们为特定数据字段使用元素索引时,我们运行一个带有元素长度的常规 for 循环。

If you need a specific range of "euro_xyz" you can get the data-code attribute values in a page.evaluate method with Element.getAttribute and check their number against the desired "xyz" number.如果您需要特定范围的“euro_xyz”,您可以使用Element.getAttributepage.evaluate方法中获取data-code属性值,并根据所需的“xyz”数字检查它们的数字。

To collect each game's data we can define a collector array ( gameObj ) which can be extended with each iteration.为了收集每个游戏的数据,我们可以定义一个收集器数组( gameObj ),它可以随着每次迭代而扩展。 In each iteration we fill an actualGame object with the actual data.在每次迭代中,我们用实际数据填充实际actualGame

It is important to determine which child elements contain the corresponding data values, eg: the home club's name is 'div.game.played > a > div:nth-child(1) > span:nth-child(1)' the div child number selects the club while the span child number decides between the club name and the points.确定哪些子元素包含相应的数据值很重要,例如:主俱乐部的名称是'div.game.played > a > div:nth-child(1) > span:nth-child(1)' div child number 选择俱乐部,而 span child number 在俱乐部名称和积分之间决定。 The loop's [i] index is responsible for grabbing the right game box's values (that's why it was counted in the beginning).循环的[i]索引负责获取正确的游戏框的值(这就是一开始计算它的原因)。

For example:例如:

const allGames = await page.$$('div.game.played')
const allGameLength = await page.$$eval('div.game.played', el => el.length)
const gameObj = []
for (let i = 0; i < allGameLength; i++) {
  try {
    let dataCode = await page.evaluate(el => el.getAttribute('data-code'), allGames[i])
    dataCode = parseInt(dataCode.replace('euro_', ''))

    if (dataCode > 243) {
      const actualGame = {
        homeClub: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(1) > span:nth-child(1)'))[i]),
        awayClub: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(2) > span:nth-child(1)'))[i]),
        homePoints: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(1) > span:nth-child(2)'))[i]),
        awayPoints: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(2) > span:nth-child(2)'))[i]),
        gameDate: await page.evaluate(el => el.textContent, (await page.$$('div.game.played > a > div:nth-child(3) > span:nth-child(1)'))[i])
      }
      gameObj.push(actualGame)
    }
  } catch (e) {
    console.error(e)
  }
}

console.log(JSON.stringify(gameObj))

There is a page.waitFor method in puppeteer for the same purpose as your sleep function, but you can also wait for selectors to be appeared ( page.waitForSelector ). puppeteer 中有一个page.waitFor方法,其目的与您的sleep function 相同,但您也可以等待选择器出现( page.waitForSelector )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM