[英]Why can't I scrape for specific information of this webpage? (with node.js and jQuery)
[英]JavaScript Node.js WebScraping: How do I find specific elements on webpage table to scrape and push into an array of objects?
我正在嘗試使用 UFC 比賽的投注網站練習網絡抓取。 我正在使用 javascript 和包 request-promise 和cheerio。
網站: https : //www.oddsshark.com/ufc/odds
我想為每個博彩公司抓取戰士的名字和他們各自的投注線。
我的目標是最終得到類似於對象數組的東西,以后我可以用這些對象為 postgresql 數據庫做種。
我想要的輸出示例(不必完全相同但相似):
[
{ fighter 1: 'Khabib Nurmagomedov', openingBetLine: -333, bovadaBetLine: -365, etc. },
{ fighter 2: 'Dustin Poirier', openingBetLine: 225, bovadaBetLine: 275, etc. },
{ fighter 3: etc.},
{ fighter 4: etc.}
]
下面是我到目前為止的代碼。 我是個菜鳥:
const rp = require("request-promise");
const url = "https://www.oddsshark.com/ufc/odds";
// cheerio to parse HTML
const $ = require("cheerio");
rp(url)
.then(function(html) {
// it worked :)
// console.log("MMA page:", html);
// console.log($("big > a", html).length);
// console.log($("big > a", html));
console.log($(".op-matchup-team-text", html).length);
console.log($(".op-matchup-team-text", html));
})
// why isn't catch working?
.catch(function(error) {
// handle error
});
我上面的代碼將索引作為鍵返回,嵌套對象作為值。 下面僅以其中之一為例。
{ '0':
{ type: 'tag',
name: 'span',
namespace: 'http://www.w3.org/1999/xhtml',
attribs: [Object: null prototype] { class: 'op-matchup-team-text' },
'x-attribsNamespace': [Object: null prototype] { class: undefined },
'x-attribsPrefix': [Object: null prototype] { class: undefined },
children: [ [Object] ],
parent:
{ type: 'tag',
name: 'div',
namespace: 'http://www.w3.org/1999/xhtml',
attribs: [Object],
'x-attribsNamespace': [Object],
'x-attribsPrefix': [Object],
children: [Array],
parent: [Object],
prev: [Object],
next: [Object] },
prev: null,
next: null },
我不知道從這里做什么。 我打電話給正確的班級(op-matchup-team-text)嗎? 如果是這樣,我如何從網站中提取戰斗機名稱和投注線標簽元素?
///////////////////////////////////////////////// //////////////////////在原始帖子上更新 1 ///////////////////// /////
更新:使用 Henk 的建議,我可以刮出戰斗機的名字。 使用戰斗機名稱的代碼模板,我也能夠刮取戰斗機投注線。
但我不知道如何在一個物體上同時獲得兩者。 例如,我如何將投注線與拳手本人聯系起來?
以下是我用於抓取 OPENING 公司投注線的代碼:
rp(url)
.then(function(html) {
const $ = cheerio.load(html);
const openingBettingLine = [];
// parent class of fighter name
$("div.op-item.op-spread.op-opening").each((index, currentDiv) => {
const openingBet = {
opening: JSON.parse(currentDiv.attribs["data-op-moneyline"]).fullgame
};
openingBettingLine.push(openingBet);
});
console.log("openingBettingLine array test 2:", openingBettingLine);
})
// why isn't catch working?
// eslint-disable-next-line handle-callback-err
.catch(function(error) {
// handle error
});
它控制台會注銷以下內容:
openingBettingLine array test 2: [ { opening: '-200' },
{ opening: '+170' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '' },
{ opening: '+105' },
{ opening: '-135' },
{ opening: '-165' },
{ opening: '+135' },
{ opening: '-120' },
{ opening: '-110' },
{ opening: '-135' },
{ opening: '+105' },
{ opening: '-165' },
{ opening: '+135' },
{ opening: '-115' },
{ opening: '-115' },
{ opening: '-145' },
{ opening: '+115' },
{ opening: '+208' },
{ opening: '-263' },
etc.
我想要的對象輸出仍然是(如下例)。 那么我如何將openingBettingLine 放入與戰斗機關聯的對象中?
[
{ fighter 1: 'Khabib Nurmagomedov', openingBetLine: -333, bovadaBetLine: -365, etc. },
{ fighter 2: 'Dustin Poirier', openingBettingLine: 225, bovadaBetLine: 275, etc. },
{ fighter 3: etc.},
{ fighter 4: etc.}
]
///////////////////////////////////////////////// //////////////////////在原始帖子上更新 2 ///////////////////// /////
我不能讓 BOVADA 公司的投注線刮掉。 我將代碼隔離到下面這家公司。
// BOVADA 投注線陣列 --> 不工作
rp(url)
.then(function(html) {
const $ = cheerio.load(html);
const bovadaBettingLine = [];
// parent class of fighter name
$("div.op-item.op-spread.border-bottom.op-bovada.lv").each(
(index, currentDiv) => {
const bovadaBet = {
BOVADA: JSON.parse(currentDiv.attribs["data-op-moneyline"]).fullgame
};
bovadaBettingLine.push(bovadaBet);
}
);
console.log("bovadaBettingLine:", bovadaBettingLine);
})
// why isn't catch working?
// eslint-disable-next-line handle-callback-err
.catch(function(error) {
// handle error
});
它返回: bovadaBettingLine: []
什么都沒有。
以下是該網站部分的 HTML 代碼。
短的:
詳細:
首先分析你想要的數據的源代碼:
<div class="op-matchup-team op-matchup-text op-team-top" data-op-name="{full_name:Jessica Andrade,short_name:}"><span class="op-matchup-team-text">Jessica Andrade</span></div>
您正在嘗試獲取戰斗機的名稱。 因此,您可以針對<span class="op-matchup-team-text">Jessica Andrade</span>
或父div
的屬性,即data-op-name="{full_name:Jessica Andrade,short_name:}"
讓我們試試第二個:
divs
: $("div.op-matchup-team.op-matchup-text.op-team-top")
each()
迭代器遍歷 divfighters
數組中。另請參閱下面的代碼注釋:
const rp = require("request-promise");
const url = "https://www.oddsshark.com/ufc/odds";
const cheerio = require("cheerio")
rp(url)
.then(function (html) {
const $ = cheerio.load(html)
const fighters = [];
$("div.op-matchup-team.op-matchup-text.op-team-top")
.each((index, currentDiv) => {
const fighter = {
name: JSON.parse(currentDiv.attribs["data-op-name"]).full_name,
//There is no direct selector for the rows of the second column based on the first one.
//So you need to select all rows of the second column as you did, and then use the current index
//to get the right row. Put the selected data into your "basket" the fighter object. Done.
openingBetLine: JSON.parse($("div.op-item.op-spread.op-opening")[index].attribs["data-op-moneyline"]).fullgame
// go on the same way with the other rows that you need.
}
fighters.push(fighter)
})
console.log(fighters)
}).catch(function (error) {
//error catch does work, you just need to print it out to see it
console.log(error)
});
會給你:
[{ name: 'Jessica Andrade',
openingBetLine: '-200'},...]
您必須調用 get() 將cheerio 對象轉換為數組:
let teamData = $('.op-matchup-wrapper').map((i, div) => ({
time: $(div).find('.op-matchup-time').text(),
teams: $(div).find('.op-matchup-team-text').map((i, t) => $(t).text()).get()
})).get()
這些投注線在團隊區域之外,因此您需要單獨獲取它們並以某種方式合並它們。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.