简体   繁体   English

JavaScript Node.js WebScraping:如何在网页表上找到特定元素以抓取并推送到对象数组中?

[英]JavaScript Node.js WebScraping: How do I find specific elements on webpage table to scrape and push into an array of objects?

I am trying to practice web scraping using a betting site for UFC fights.我正在尝试使用 UFC 比赛的投注网站练习网络抓取。 I am using javascript and the packages request-promise and cheerio.我正在使用 javascript 和包 request-promise 和cheerio。

Site: https://www.oddsshark.com/ufc/odds网站: https : //www.oddsshark.com/ufc/odds

I want to scrape the name of the fighters and their respective betting lines for each betting company.我想为每个博彩公司抓取战士的名字和他们各自的投注线。

网站的示例屏幕截图

My goal is to end up with something like an array of objects that I can later seed a postgresql database with.我的目标是最终得到类似于对象数组的东西,以后我可以用这些对象为 postgresql 数据库做种。

Example of my desired output (doesn't have to be exactly like that but similar):我想要的输出示例(不必完全相同但相似):

[
  { fighter 1: 'Khabib Nurmagomedov', openingBetLine: -333, bovadaBetLine: -365, etc. },
  { fighter 2: 'Dustin Poirier', openingBetLine: 225, bovadaBetLine: 275, etc. },
  { fighter 3: etc.},
  { fighter 4: etc.}
]

Below is the code I have so far.下面是我到目前为止的代码。 I am a noob at this:我是个菜鸟:

const rp = require("request-promise");
const url = "https://www.oddsshark.com/ufc/odds";


// cheerio to parse HTML
const $ = require("cheerio");

rp(url)
  .then(function(html) {
    // it worked :)

    // console.log("MMA page:", html);
    // console.log($("big > a", html).length);
    // console.log($("big > a", html));

    console.log($(".op-matchup-team-text", html).length);
    console.log($(".op-matchup-team-text", html));
  })
  // why isn't catch working?
  .catch(function(error) {
    // handle error
  });

My code above returns indexes as keys with nested objects as values.我上面的代码将索引作为键返回,嵌套对象作为值。 Below is just one of them as an example.下面仅以其中之一为例。

{ '0':
   { type: 'tag',
     name: 'span',
     namespace: 'http://www.w3.org/1999/xhtml',
     attribs: [Object: null prototype] { class: 'op-matchup-team-text' },
     'x-attribsNamespace': [Object: null prototype] { class: undefined },
     'x-attribsPrefix': [Object: null prototype] { class: undefined },
     children: [ [Object] ],
     parent:
      { type: 'tag',
        name: 'div',
        namespace: 'http://www.w3.org/1999/xhtml',
        attribs: [Object],
        'x-attribsNamespace': [Object],
        'x-attribsPrefix': [Object],
        children: [Array],
        parent: [Object],
        prev: [Object],
        next: [Object] },
     prev: null,
     next: null },

I don't know what to do from here.我不知道从这里做什么。 Am I calling the right class (op-matchup-team-text)?我打电话给正确的班级(op-matchup-team-text)吗? If so, how do I extract the fighter names and betting line tag elements from the website?如果是这样,我如何从网站中提取战斗机名称和投注线标签元素?

////////////////////////////////////////////////////////////////////////// UPDATE 1 ON ORIGINAL POST ////////////////////////// ///////////////////////////////////////////////// //////////////////////在原始帖子上更新 1 ///////////////////// /////

Updated: Using Henk's suggestion, I'm able to scrape fighter name.更新:使用 Henk 的建议,我可以刮出战斗机的名字。 Using the code template for fighter name, I was able to scrape fighter betting lines as well.使用战斗机名称的代码模板,我也能够刮取战斗机投注线。

BUT I don't know how to get both on one object.但我不知道如何在一个物体上同时获得两者。 For example, how do I associate the betting line with the fighter him/herself?例如,我如何将投注线与拳手本人联系起来?

Below is my code for scraping the OPENING company's betting line:以下是我用于抓取 OPENING 公司投注线的代码:

rp(url)
  .then(function(html) {
    const $ = cheerio.load(html);

    const openingBettingLine = [];

    // parent class of fighter name
    $("div.op-item.op-spread.op-opening").each((index, currentDiv) => {
      const openingBet = {
        opening: JSON.parse(currentDiv.attribs["data-op-moneyline"]).fullgame
      };
      openingBettingLine.push(openingBet);
    });
    console.log("openingBettingLine array test 2:", openingBettingLine);
  })
  // why isn't catch working?
  // eslint-disable-next-line handle-callback-err
  .catch(function(error) {
    // handle error
  });

It console logs out the following:它控制台会注销以下内容:

openingBettingLine array test 2: [ { opening: '-200' },
  { opening: '+170' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '+105' },
  { opening: '-135' },
  { opening: '-165' },
  { opening: '+135' },
  { opening: '-120' },
  { opening: '-110' },
  { opening: '-135' },
  { opening: '+105' },
  { opening: '-165' },
  { opening: '+135' },
  { opening: '-115' },
  { opening: '-115' },
  { opening: '-145' },
  { opening: '+115' },
  { opening: '+208' },
  { opening: '-263' },
  etc. 

My desired object output is still (as example below).我想要的对象输出仍然是(如下例)。 So how would I get the openingBettingLine into the object associated with the fighter?那么我如何将openingBettingLine 放入与战斗机关联的对象中?

[
  { fighter 1: 'Khabib Nurmagomedov', openingBetLine: -333, bovadaBetLine: -365, etc. },
  { fighter 2: 'Dustin Poirier', openingBettingLine: 225, bovadaBetLine: 275, etc. },
  { fighter 3: etc.},
  { fighter 4: etc.}
]

////////////////////////////////////////////////////////////////////////// UPDATE 2 ON ORIGINAL POST ////////////////////////// ///////////////////////////////////////////////// //////////////////////在原始帖子上更新 2 ///////////////////// /////

I can't get the BOVADA company's betting line to scrape.我不能让 BOVADA 公司的投注线刮掉。 I isolated the code to just this company below.我将代码隔离到下面这家公司。

// BOVADA betting line array --> not working // BOVADA 投注线阵列 --> 不工作

rp(url)
  .then(function(html) {
    const $ = cheerio.load(html);

    const bovadaBettingLine = [];

    // parent class of fighter name
    $("div.op-item.op-spread.border-bottom.op-bovada.lv").each(
      (index, currentDiv) => {
        const bovadaBet = {
          BOVADA: JSON.parse(currentDiv.attribs["data-op-moneyline"]).fullgame
        };
        bovadaBettingLine.push(bovadaBet);
      }
    );
    console.log("bovadaBettingLine:", bovadaBettingLine);
  })
  // why isn't catch working?
  // eslint-disable-next-line handle-callback-err
  .catch(function(error) {
    // handle error
  });

It returns: bovadaBettingLine: [] with nothing in it.它返回: bovadaBettingLine: []什么都没有。

Below is the HTML code for that part of the website.以下是该网站部分的 HTML 代码。

在此处输入图片说明

Short:短的:

  1. Select the right data with a suitable cheerio method使用合适的cheerio方法选择正确的数据
  2. Create your own object, and put your data in there创建您自己的对象,并将您的数据放在那里

In Detail:详细:

first analyse the source code of your desired data:首先分析你想要的数据的源代码:

<div class="op-matchup-team op-matchup-text op-team-top" data-op-name="{full_name:Jessica Andrade,short_name:}"><span class="op-matchup-team-text">Jessica Andrade</span></div>

You are trying to get the name of the fighter.您正在尝试获取战斗机的名称。 So you could aim for the content of the <span class="op-matchup-team-text">Jessica Andrade</span> or the attribute of the parents div which is data-op-name="{full_name:Jessica Andrade,short_name:}"因此,您可以针对<span class="op-matchup-team-text">Jessica Andrade</span>或父div的属性,即data-op-name="{full_name:Jessica Andrade,short_name:}"

Let's try the second one:让我们试试第二个:

  1. get all the divs with the desired content: $("div.op-matchup-team.op-matchup-text.op-team-top")获取具有所需内容的所有divs$("div.op-matchup-team.op-matchup-text.op-team-top")
  2. traverse the divs with cheerios built-in each() iterator使用内置的each()迭代器遍历 div
  3. Within each iteration create an object with all relevant fighter parameters and push them into an fighters array.在每次迭代中,创建一个具有所有相关战斗机参数的对象,并将它们推送到fighters数组中。

see also the code comments below:另请参阅下面的代码注释:

const rp = require("request-promise");
const url = "https://www.oddsshark.com/ufc/odds";
const cheerio = require("cheerio")
rp(url)
    .then(function (html) {

    const $ = cheerio.load(html)


    const fighters = [];
    $("div.op-matchup-team.op-matchup-text.op-team-top")
        .each((index, currentDiv) => {
            const fighter = {
                name: JSON.parse(currentDiv.attribs["data-op-name"]).full_name,
                //There is no direct selector for the rows of the second column based on the first one.
                //So you need to select all rows of the second column as you did, and then use the current index
                //to get the right row. Put the selected data into your "basket" the fighter object. Done.
                openingBetLine: JSON.parse($("div.op-item.op-spread.op-opening")[index].attribs["data-op-moneyline"]).fullgame
                // go on the same way with the other rows that you need.
            }

            fighters.push(fighter)
        })

    console.log(fighters)


    }).catch(function (error) {
     //error catch does work, you just need to print it out to see it
     console.log(error)
    });

will give you:会给你:

[{ name: 'Jessica Andrade',
openingBetLine: '-200'},...]

You have to call get() to turn the cheerio object into an array:您必须调用 get() 将cheerio 对象转换为数组:

let teamData = $('.op-matchup-wrapper').map((i, div) => ({
  time: $(div).find('.op-matchup-time').text(),
  teams: $(div).find('.op-matchup-team-text').map((i, t) => $(t).text()).get()
})).get()

Those betting lines are outside of the teams area so you would need to get them separately and merge them somehow.这些投注线在团队区域之外,因此您需要单独获取它们并以某种方式合并它们。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM