I am trying to gather line-ups from football/soccer reports. I decided to web-scrape the data from a reports provider, but their websites are loaded with javascript.
To be more specific, let's take this link to a flashscores.co.uk match .
First, they restrict CORS, which means I used allorigins.me to avoid it and then I used this code:
function readurl(url, elementID){
var url = "http://allorigins.me/get?url=" + encodeURIComponent(url) + "&callback=?";
var xhttp = new XMLHttpRequest();
xhttp.onreadystatechange = function() {
if (this.readyState == 4 && this.status == 200) {
document.getElementById(elementID).innerHTML = this.responseText;
}
};
xhttp.open("GET", url, true);
xhttp.send();
}
The result was something like this and it looks the same all the way down (still \\n and \\t, not the real content). I guess the problem is that the flashscores website is using javascript to load the data, but allorigins.me did not "wait" until the whole website was loaded. Here is another look , where it seems that is being loaded with javascript.
The desired result is to gather the starting elevens of both teams (Allonso M., Arrizabalaga K., Azpilicueta C.,...). I inspected the website and found, that every name is inside a HTML tag: <div class="name">PLAYER'S NAME HERE</div>
.
Any idea how to avoid both problems at once?
I am trying to use client-side languages (no PHP).
Thank you :)
There are a few problems with your question:
I recommend you use something like JSDom with Node.js for this task, should be quite simple.
A great blog post about web scraping with Node.js (without script execution): here
official JSDom npm page: here
Good Luck !
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.