简体   繁体   中英

In R, use rvest and xml2 to extract JSON object from a <script> element on website

Previously posted a related stackoverflow question about scraping a table on the leaderboard page of the PGA's website on this page . To summarize that post, the leaderboard table is difficult to scrape apparently because of the way this page uses javascript to render the page and table.

I can inspect and I see in the tag that there is an object global.leaderboardConfig with useful info:

在此处输入图片说明

Is it possible to get this object as a list in R? I am able to grab all 76 script elements on the page using xml2::read_html('https://www.pgatour.com/leaderboard.html') %>% html_nodes('script') , however I'm not sure how to identify the specific script tag needed, nor how to get an object out of it.

Edit: In the networks tab of devtools, there is also this request which provides the link for an API call that gets the data. Rather than fetching the object from the script tag, perhaps it is easier to grab all network requests and sift through those instead?

在此处输入图片说明

This site generates the hmac and expire url parameters value from a JS function that is using a specific algorithm. The arguments of this algorithm are depending on the epoch time which is passed as url parameter to the JS file hosting that function here . This way, the hmac value is different each time because it's processed from this file whose url is changing constantly.

This algorithm consists of bitwise and & xor like this (pseudocode):

step = ((value * value - fixedValue) & bitMask) ^ xorKey1
result += fromCharCode(step)
value = step

step = ((value * value - fixedValue) & bitMask) ^ xorKey2
result += fromCharCode(step)
value = step

....
....

The xorKey numbers are generated dynamically on https://microservice.pgatour.com/js based on epoch time. You just need to request this js file with the current epoch time as url parameter and extract with regex all stepValues that are required in the above algorithm (starting with -1 ). You will also need to reproduce the alogithm above in

The following script generates the url parameters and makes the API call:

library(httr)
library(stringr)
library(bitops)

# fixed values
init <- 4294967295
value <- 101
encodedId <- 1798339286
result <- rawToChar(as.raw(value))

# epoch time is dynamic
time <- as.numeric(as.POSIXct(Sys.time()))*1000

output <- content(GET("https://microservice.pgatour.com/js", query = list(
    "_" = format(time, digits=13)
  )), as = "text", encoding = "UTF-8")

steps <- regmatches(output, gregexpr("-1[0-9]+", output, perl=TRUE)) #extract steps
stepsNum <- as.numeric(unlist(steps)) #convert to num

for(t in stepsNum){
    step <- bitXor(bitAnd(value * value - encodedId, init), t)
    result <- paste0(result, rawToChar(as.raw(step)));
    value <- step;
}
print(result)

# extract leaderboard config url
output <- content(GET("https://www.pgatour.com/leaderboard.html"), as = "text", encoding = "UTF-8")
configUrl = gsub("\\\\/", "/", str_match(output, "\\leaderboardUrl:\\s*'(.*)'")[2])

url = paste0(configUrl,"?userTrackingId=",result)
data <- content(GET(url), as = "parsed", type = "application/json")

print(data)

kaggle link: https://www.kaggle.com/bertrandmartel/pgatourextract

How to find this algorithm ?

I've searched in the Javascript code and reversed the obfuscated code to be decoded into something understandable. This is quite a long way to go. Let's go there step by step.

Mission n°1 - search for leaderboardUrl

You've given the first hint in your question, the location of the config where there is a leaderboardUrl .

There is this JS file named stroke-play-leaderboard-controller-56223356ffc8423f5d6e.js that have occurences of leaderboardUrl in config.leaderboardUrl :

{
    key: "getLeaderboardData",
    value: function (t, r, n) {
      var o = this,
        e = (0, h.resolveUrl)(this.config.leaderboardUrl, r()),  <===================== HERE
        a = [this.performFetch(e)].concat(
          g(
            "initial" === n && this.config.translationsUrl
              ? [y.default.load(this.config.translationsUrl)]
              : []
          )
        ),
        ..........
}

Let's look at performFetch function that seems to send the request

{
    key: "performFetch",
    value: function (t) {
      var r = this,
        e =
          1 < arguments.length && void 0 !== arguments[1]
            ? arguments[1]
            : {};
      return t
        ? ((0, a.isProtectedUrl)(t) &&
            (t = this.getUrlWithAuth(t)), <===================== HERE
          (0, o.default)(t, e)
            .then(function (e) {
              return r.checkFetchResponseStatus(e, t);
    .................

We've spotted the getUrlWithAuth function:

  {
    key: "getUrlWithAuth",
    value: function (e) {
      var t = u.setTrackingUserId, 
        r = u.UserIdTracker, 
        n = r && r.getTrackingUserIdParam && r.getUserId; <===================== HERE
      if (t && n) {
        var o = r.getTrackingUserIdParam(), <===================== HERE
          a = t(r.getUserId());
        return u.setUrlParameter(e, o, a);
      }
      return e;
    },
  },

Now, we have getUserId and getTrackingUserIdParam that look like the function and variable adding the authorization parameters to the url. The problem is we have to find where is this function located.

Mission n°2 - Deobfuscation challenge: substitutions

I've spotted this file named main.c03ddfd249437fcce43410c35a21c6f8.js where there is an occurence of getUserId and getTrackingUserIdParam :

var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];
var A = function(g, e) {
    return t[g -= 398]
},
(function(g, e) {
    for (var t = A; ; )
        try {
            if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
                break;
            g.push(g.shift())
        } catch (e) {
            g.push(g.shift())
        }
}
)(t)
.................
function(g, e) {
    var t = A
      , C = e[t(439) + t(403) + "r"] = e["pga" + t(403) + "r"] || {}
      , I = t(428) + t(423) + t(407)
      , o = t(483) + "rTr" + t(446) + t(477) + "Id";
    C[t(489) + t(463) + t(469) + "cker"] = {
        ........................
        getTrackingUserIdParam: function() {
            return o
        },
        getUserId: function() {
            return I
        },
        ......................
    }
}(jQuery, window)
},

I've skipped a lot of code in the above snippet so it's more clear.

You can see that there are substitutions here, using the t array as a base, it will offset the strings using the A function and there is an init function that updated the initial t array so that it decodes to the right strings

You can paste this snippet into a nodejs script, modify it a little and then you can use something like:

var t = ["PCl", "tUp", "set", "270981fHMpFv", "13687NsSiEo", "rId", "cri", "onR", "DEV", "oTr", "Int", "Tra", "_PR", "val", "1ceOiGP", "sts", "oad", "Fin", "_UA", "ing", "IdP", "TA_", "Scr", "erI", "hTo", "use", "erv", "tim", "tus", "205913gYUZtZ", "ara", "Use", "sta", "STA", "LBD", "pat", "253565HlVREe", "rva", "Ref", "now", "ien", "ref", "89874aJWuvR", "scr", "HTT", "arI", "equ", "efo", "Eve", "ngU", "1eAGUfS", "Url", "bef", "onl", "res", "p:/", "add", "nte", "rlT", "IdS", "Loa", "157231BKOPfc", "1YJedti", "hIn", "ate", "ser", "TDA", "bin", "upd", "xEr", "tou", "dHo", "ps:", "din", "931", "isU", "aja", "tup", "ste", "ntL", "Pat", "_DE", "ays", "onO", "edA", "Sen", "261593SNtpWc", "ore", "gth", "las", "730", "ame", "ter", "ime", "UAT", "id8", "ues", "est", "rtT", "xSe", "ist", "ptL", "ATA", "len", "ipt", "get", "pga", "Tru", "rep", "ish", "url", "alw", "dat", "ack", "lac", "onB", "uld", "cki", "ken", "ind", "onS", "sho", "htt", "ror", "API"];

var A = function(g, e) {
    return t[g -= 398]
};
console.log(t);
(function(g, e) {
    for (var t = A; ; )
        try {
            if (189298 === parseInt(t(494)) + -parseInt(t(508)) * parseInt(t(461)) + -parseInt(t(519)) + parseInt(t(419)) + -parseInt(t(520)) * -parseInt(t(487)) + parseInt(t(472)) * -parseInt(t(462)) + -parseInt(t(500)))
                break;
            g.push(g.shift())
        } catch (e) {
            g.push(g.shift())
        }
}
)(t)
console.log(t);
console.log(`e[${A(439) + A(403) + "r"}] = e[${"pga" + A(403) + "r"}] || {};`);

// prints e[pgatour] = e[pgatour] || {};

Here e is window so you "just" have to substitute all the A(XXX) in order to understand better what is going on.

You would spot this:

onBeforeSendRequest: function(g, e) {
    var A = t;
    if (this[A(408) + A(516) + "oTr" + A(446)](e.url) && window[A(439) + A(403) + "r"][A(460) + A(469) + A(450) + A(507) + A(398) + "Id"]) {
        var I = this["getUse" + A(463)]()
            , o = window[A(439) + A(403) + "r"]["set" + A(469) + A(450) + A(507) + A(398) + "Id"](I)
            , n = this[A(438) + A(469) + A(450) + A(507) + "ser" + A(478) + A(488) + "m"]();
        e.url = C[A(460) + A(509) + "Par" + A(424) + A(425)](e[A(443)], n, o)
    }
},

which when decoded gives something like:

onBeforeSendRequest: function(g, e) {
    if (this["isUrlToTrack"](e.url) && window["pgatour"]["setTrackingUserId"]) {
        var I = this["getUserId"]()
            , o = window["pgatour"]["setTrackingUserId"](I)
            , n = this["getTrackingUserIdParam"]();
        e.url = C["setUrlParameter"](e["url"], n, o)
    }
},

The function we are looking for is window["pgatour"]["setTrackingUserId"] . But we could have known this since mission n°1. Remember in the first JS file:

var t = u.setTrackingUserId

and u being window.pgatour

But here, we have I the input parameter that is hard coded :

var I = A(428) + A(423) + A(407);

which is equivalent to var I = "id8730931"

Now let's look at window["pgatour"]["setTrackingUserId"] function

Mission n°3 - Crypto/reverse

Open chrome developer console on the website, paste window["pgatour"]["setTrackingUserId"] you will get something like this:

function(_$_$){var $$ = _$__(_$_$); var _$_, ___, __;.................

Yes :( again more obfuscated code to deal with

By looking at the application script, you may find that it's located in this file . This is the JS file url:

https://microservice.pgatour.com/js?_=1618868625306

There is an url parameter specifying an epoch time and the code changes depending on this parameter

Looking at the code itself, we get something like this after substituing the input parameters which are String.fromCharCode and Math.abs

((function($__$, _, $_$) { 
    var $$_ = 4294967295; <===================== doesn't change when the epoch time is updated
    function _$__($) {
        var $$__ = 42;
        for (var _ = 0; _ < $.length; _++) {
            $$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
        }
        return Math.abs($$__);
    }
    ......
    _$_ = (__ * __ - $$) & $$_ ^ -30086, <===================== doesn't change when the epoch time is updated
    ___ += _(_$_),
    __ = _$_,
    _$_ = (__ * __ - $$) & $$_ ^ -33221,
    ___ += _(_$_),
    .....
    $__$[__$_] = (function(_$_$ = "id8730931") { <===================== this is window["pgatour"]["setTrackingUserId"] function / input is id8730931
        var $$ = _$__(_$_$);
        var _$_, ___, __;
        var __ = (__ = 101,
        ___ = String.fromCharCode(__),
        _$_ = (__ * __ - $$) & $$_ ^ -1798328965, <===================== this change when epoch time is updated
        ___ += String.fromCharCode(_$_),
        __ = _$_,
        _$_ = (__ * __ - $$) & $$_ ^ -1798324966,
        ___ += String.fromCharCode(_$_),
        __ = _$_,
        ....
        __ = _$_,
        ___);
        return __
    }
    );
}
)((window.pgatour || (window.pgatour = {})), String.fromCharCode, Math.abs));

We can make a script to reproduce this algorithm in a simpler way by extracting the step value (in the xor stage):

const axios = require("axios");

const init = 4294967295;
var value = 101;
var encodedId = 1798339286;
var result = String.fromCharCode(value);

(async function () {
  const response = await axios.get(
    "https://microservice.pgatour.com/js?_=1618868625506"
  );
  data = response.data.match(/-17\d+/g).map((it) => parseInt(it));

  for (t of data) {
    var step = ((value * value - encodedId) & init) ^ t;
    result += String.fromCharCode(step);
    value = step;
  }
  console.log(result);
})();

output:

exp=1618882930~acl=*~hmac=0274aecb617168167713a757e301c33e9708da3ab643663f97a4775040bf3bdd

If you change the epoch time, it will give a different result

repl.it: https://replit.com/@bertrandmartel/PegatourEncrypt

Then you just need to convert this script in and make your http call with the url parameters

Note that encodedId comes from the input id id8730931 converted using this function (those values don't seem to change with the epoch time):

var $$_ = 4294967295;
function _$__($) {
    var $$__ = 42;
    for (var _ = 0; _ < $.length; _++) {
        $$__ = ($$__ * 31 + $.charCodeAt(_)) & $$_;
    }
    return Math.abs($$__);
}

My guess is that the server is checking that the hmac is correctly referring to the initial id string id8730931 so it's safe to harcode (since it's also harcoded in the server)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM