简体   繁体   中英

Is there a way to programmatically run a javascript function from the source of a website

On basketball-reference.com, there is an injury page that shows all of the current injuries in the NBA. I'd like to begin archiving this data to keep a record of whose injured in the NBA daily. Apart from simply being a basketball stat nut, this is will be an input to a Bayesian Model that predicts a players playing time from his teammates injuries.

Now, I could simply go to his page once a day, click the Get Table as a CSV" button , and copy and paste that into a file, but this seems like a cron job.

I could grab the raw html and parse it but the web page already has a get_csv_output(e) function in its sr-min.js file readily available.In fact, if I open up the developer console and type in

get_csv_output("injuries")

I get all of the csv dumped out as a string. It feels an awful lot like reinventing the wheel when I could simply use this function.

Somehow there is a disconnect in my mind though. I don't grok how I can visit a page, run a js function, and save the output without spinning up a full chrome driver instance through selenium or something. This feels like a simple problem with a simple solution that I just don't know.

I don't particularly care what language the solution is in, although I'd prefer a python, bash, or some other light weight solution.

Please let me know if I'm being naive.

Edit: The page is https://www.basketball-reference.com/friv/injuries.cgi

Edit 2: The accepted answer is an excellent solution for future reference.

I ended up doing

curl https://www.basketball-reference.com/friv/injuries.cgi | python3 convert_injury_html_to_csv.py > "$(date +'%Y%m%d')".tsv

Where the python script is...

import sys
from bs4 import BeautifulSoup


def parse_injury_html(html_doc):
    soup = BeautifulSoup(html_doc, "html.parser")
    injuries_table = soup.find(id="injuries")
    for row in injuries_table.tbody.find_all("tr"):
        if row.get('class', None) == "thead":
            continue
        name = row.th
        team, update, description = row.find_all("td")
        yield((name.string, team.string, update.string, description.string))


def main():
    for (name, team, update, description) in parse_injury_html(sys.stdin.read()):
        print(f"{name}\t{team}\t{update}\t{description}")


if __name__ == '__main__':
    main()

You could more directly just run the code in that JS function. Node.js is a standalone JS engine, so you may be able to use it to run the exact same function.

That function is most likely just making HTTP requests to download the data from a server, perhaps with some mild data manipulations. The networking layer between node and browser JS are not the same, but there are polyfills available. If the JS function is using the fetch API, you can use node-fetch , or if it's using XHR-style requests, xmlhttprequest .

Since the code is probably a simple data fetch, it might be simple enough to reverse-engineer what's going on and write your own script yourself in whatever language you prefer to make the same type of HTTP request. Watching what's going on in the network tab of your developer tools should tell you where it's getting its data.

Just executing this function won't do no good because it must be executed in context of that injuries page. If you look at its code, it effectively parses html data. Weird way of doing things but I saw worse. Nevermind.

The easiest solution will be using something that opens the page and calls the function just like you do it in devtools. Barmar suggested Selenium, but I personally prefer puppeteer. It is run via NodeJS, it opens Chrome in windowless mode and executes any open API on any site. In our case - the get_csv_output function.

After that you may do whatever you want with the result string. Dump it to DB or save to file.

An example of puppeteer code .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM