简体   繁体   中英

Web-scraping a website, that is being loaded with javascript (using javascript)

I am trying to gather line-ups from football/soccer reports. I decided to web-scrape the data from a reports provider, but their websites are loaded with javascript.

To be more specific, let's take this link to a flashscores.co.uk match .

First, they restrict CORS, which means I used allorigins.me to avoid it and then I used this code:

function readurl(url, elementID){
    var url = "http://allorigins.me/get?url=" + encodeURIComponent(url) + "&callback=?";
    var xhttp = new XMLHttpRequest();
    xhttp.onreadystatechange = function() {
        if (this.readyState == 4 && this.status == 200) {
            document.getElementById(elementID).innerHTML = this.responseText;
        }
    };
    xhttp.open("GET", url, true);
    xhttp.send();
}

The result was something like this and it looks the same all the way down (still \\n and \\t, not the real content). I guess the problem is that the flashscores website is using javascript to load the data, but allorigins.me did not "wait" until the whole website was loaded. Here is another look , where it seems that is being loaded with javascript.

The desired result is to gather the starting elevens of both teams (Allonso M., Arrizabalaga K., Azpilicueta C.,...). I inspected the website and found, that every name is inside a HTML tag: <div class="name">PLAYER'S NAME HERE</div> .

Any idea how to avoid both problems at once?

  1. CORS restriction
  2. The delay before the web is "filled" with data from javascript

I am trying to use client-side languages (no PHP).

Thank you :)

There are a few problems with your question:

  1. CORS is used to protect resources on the server side, and you need the client side resources, which are mostly public, so you do not need a way to avoid it.
  2. The problem is not "waiting" until the page will load, the problem is you need to run these scripts yourself.

I recommend you use something like JSDom with Node.js for this task, should be quite simple.

A great blog post about web scraping with Node.js (without script execution): here

official JSDom npm page: here

Good Luck !

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM