简体   繁体   中英

JavaScript Node.js WebScraping: How do I find specific elements on webpage table to scrape and push into an array of objects?

I am trying to practice web scraping using a betting site for UFC fights. I am using javascript and the packages request-promise and cheerio.

Site: https://www.oddsshark.com/ufc/odds

I want to scrape the name of the fighters and their respective betting lines for each betting company.

网站的示例屏幕截图

My goal is to end up with something like an array of objects that I can later seed a postgresql database with.

Example of my desired output (doesn't have to be exactly like that but similar):

[
  { fighter 1: 'Khabib Nurmagomedov', openingBetLine: -333, bovadaBetLine: -365, etc. },
  { fighter 2: 'Dustin Poirier', openingBetLine: 225, bovadaBetLine: 275, etc. },
  { fighter 3: etc.},
  { fighter 4: etc.}
]

Below is the code I have so far. I am a noob at this:

const rp = require("request-promise");
const url = "https://www.oddsshark.com/ufc/odds";


// cheerio to parse HTML
const $ = require("cheerio");

rp(url)
  .then(function(html) {
    // it worked :)

    // console.log("MMA page:", html);
    // console.log($("big > a", html).length);
    // console.log($("big > a", html));

    console.log($(".op-matchup-team-text", html).length);
    console.log($(".op-matchup-team-text", html));
  })
  // why isn't catch working?
  .catch(function(error) {
    // handle error
  });

My code above returns indexes as keys with nested objects as values. Below is just one of them as an example.

{ '0':
   { type: 'tag',
     name: 'span',
     namespace: 'http://www.w3.org/1999/xhtml',
     attribs: [Object: null prototype] { class: 'op-matchup-team-text' },
     'x-attribsNamespace': [Object: null prototype] { class: undefined },
     'x-attribsPrefix': [Object: null prototype] { class: undefined },
     children: [ [Object] ],
     parent:
      { type: 'tag',
        name: 'div',
        namespace: 'http://www.w3.org/1999/xhtml',
        attribs: [Object],
        'x-attribsNamespace': [Object],
        'x-attribsPrefix': [Object],
        children: [Array],
        parent: [Object],
        prev: [Object],
        next: [Object] },
     prev: null,
     next: null },

I don't know what to do from here. Am I calling the right class (op-matchup-team-text)? If so, how do I extract the fighter names and betting line tag elements from the website?

////////////////////////////////////////////////////////////////////////// UPDATE 1 ON ORIGINAL POST //////////////////////////

Updated: Using Henk's suggestion, I'm able to scrape fighter name. Using the code template for fighter name, I was able to scrape fighter betting lines as well.

BUT I don't know how to get both on one object. For example, how do I associate the betting line with the fighter him/herself?

Below is my code for scraping the OPENING company's betting line:

rp(url)
  .then(function(html) {
    const $ = cheerio.load(html);

    const openingBettingLine = [];

    // parent class of fighter name
    $("div.op-item.op-spread.op-opening").each((index, currentDiv) => {
      const openingBet = {
        opening: JSON.parse(currentDiv.attribs["data-op-moneyline"]).fullgame
      };
      openingBettingLine.push(openingBet);
    });
    console.log("openingBettingLine array test 2:", openingBettingLine);
  })
  // why isn't catch working?
  // eslint-disable-next-line handle-callback-err
  .catch(function(error) {
    // handle error
  });

It console logs out the following:

openingBettingLine array test 2: [ { opening: '-200' },
  { opening: '+170' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '' },
  { opening: '+105' },
  { opening: '-135' },
  { opening: '-165' },
  { opening: '+135' },
  { opening: '-120' },
  { opening: '-110' },
  { opening: '-135' },
  { opening: '+105' },
  { opening: '-165' },
  { opening: '+135' },
  { opening: '-115' },
  { opening: '-115' },
  { opening: '-145' },
  { opening: '+115' },
  { opening: '+208' },
  { opening: '-263' },
  etc. 

My desired object output is still (as example below). So how would I get the openingBettingLine into the object associated with the fighter?

[
  { fighter 1: 'Khabib Nurmagomedov', openingBetLine: -333, bovadaBetLine: -365, etc. },
  { fighter 2: 'Dustin Poirier', openingBettingLine: 225, bovadaBetLine: 275, etc. },
  { fighter 3: etc.},
  { fighter 4: etc.}
]

////////////////////////////////////////////////////////////////////////// UPDATE 2 ON ORIGINAL POST //////////////////////////

I can't get the BOVADA company's betting line to scrape. I isolated the code to just this company below.

// BOVADA betting line array --> not working

rp(url)
  .then(function(html) {
    const $ = cheerio.load(html);

    const bovadaBettingLine = [];

    // parent class of fighter name
    $("div.op-item.op-spread.border-bottom.op-bovada.lv").each(
      (index, currentDiv) => {
        const bovadaBet = {
          BOVADA: JSON.parse(currentDiv.attribs["data-op-moneyline"]).fullgame
        };
        bovadaBettingLine.push(bovadaBet);
      }
    );
    console.log("bovadaBettingLine:", bovadaBettingLine);
  })
  // why isn't catch working?
  // eslint-disable-next-line handle-callback-err
  .catch(function(error) {
    // handle error
  });

It returns: bovadaBettingLine: [] with nothing in it.

Below is the HTML code for that part of the website.

在此处输入图片说明

Short:

  1. Select the right data with a suitable cheerio method
  2. Create your own object, and put your data in there

In Detail:

first analyse the source code of your desired data:

<div class="op-matchup-team op-matchup-text op-team-top" data-op-name="{full_name:Jessica Andrade,short_name:}"><span class="op-matchup-team-text">Jessica Andrade</span></div>

You are trying to get the name of the fighter. So you could aim for the content of the <span class="op-matchup-team-text">Jessica Andrade</span> or the attribute of the parents div which is data-op-name="{full_name:Jessica Andrade,short_name:}"

Let's try the second one:

  1. get all the divs with the desired content: $("div.op-matchup-team.op-matchup-text.op-team-top")
  2. traverse the divs with cheerios built-in each() iterator
  3. Within each iteration create an object with all relevant fighter parameters and push them into an fighters array.

see also the code comments below:

const rp = require("request-promise");
const url = "https://www.oddsshark.com/ufc/odds";
const cheerio = require("cheerio")
rp(url)
    .then(function (html) {

    const $ = cheerio.load(html)


    const fighters = [];
    $("div.op-matchup-team.op-matchup-text.op-team-top")
        .each((index, currentDiv) => {
            const fighter = {
                name: JSON.parse(currentDiv.attribs["data-op-name"]).full_name,
                //There is no direct selector for the rows of the second column based on the first one.
                //So you need to select all rows of the second column as you did, and then use the current index
                //to get the right row. Put the selected data into your "basket" the fighter object. Done.
                openingBetLine: JSON.parse($("div.op-item.op-spread.op-opening")[index].attribs["data-op-moneyline"]).fullgame
                // go on the same way with the other rows that you need.
            }

            fighters.push(fighter)
        })

    console.log(fighters)


    }).catch(function (error) {
     //error catch does work, you just need to print it out to see it
     console.log(error)
    });

will give you:

[{ name: 'Jessica Andrade',
openingBetLine: '-200'},...]

You have to call get() to turn the cheerio object into an array:

let teamData = $('.op-matchup-wrapper').map((i, div) => ({
  time: $(div).find('.op-matchup-time').text(),
  teams: $(div).find('.op-matchup-team-text').map((i, t) => $(t).text()).get()
})).get()

Those betting lines are outside of the teams area so you would need to get them separately and merge them somehow.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM