简体   繁体   中英

Scrapy extract data from dynamic table

I am trying to pull all the TD values from the table="table-main" from the website: http://www.oddsportal.com/basketball/usa/nba/results/

I am using Scrapy and Python 2.7

From Scrapy Shell I can get the table via:

response.xpath('//*[@id="tournamentTable"]')

But I cannot seem to get any of the TR or TD of that table.

response.xpath('//*[@id="tournamentTable"]/tbody')

and response.xpath('//*[@id="tournamentTable"]/tbody/tr')

returns an empty list. I suspect that perhaps the table is created dynamically. Can anyone please help me with scraping all the team names, scores, and odds from that table. I have been stuck on this for a while.

This question is different to what people recommend is a duplicate here: Scrapy not finding table because that question is about getting the table. This question is about getting the data in the table.

Yes, the results are loaded with an additional call to the website API. In this case the request is made to http://fb.oddsportal.com/ajax-sport-country-tournament-archive/3/MmbLsWh8/X0/1/-1/1/?_=1446338252826 .

I'm not sure you can hardcode the URL in your spider since, at least, there are these 3 and MmbLsWh8 parts of the URL that are actually coming from a script tag on the main page:

<script type="text/javascript">
    //<![CDATA[
    var op = new OpHandler();if(!page)var page = new PageTournament({"id":"MmbLsWh8","sid":3,"cid":200,"archive":true});var menu_open = null;vJs();op.init();if(page && page.display)page.display();    var sigEndPage = true;
    try
    {
        if (sigEndJs)
        {
            globals.onPageReady();
        }
    } catch (e)
    {
    }

    //]]>
</script>

Plus, there is a _ parameter, that looks like a timestamp.

The call to this AJAX url would return you a JSONP response with an HTML code of the NBA results inside. You need to extract the HTML code from the response (with a regular expressions, for instance), feed it to a Selector and extract the results. Some sample code from the shell to get you started:

$ scrapy shell http://www.oddsportal.com/basketball/usa/nba/results/
In [1]: fetch("http://fb.oddsportal.com/ajax-sport-country-tournament-archive/3/MmbLsWh8/X0/1/-1/1/?_=1446338252826")
In [2]: import re
In [3]: pattern = re.compile(r'"html":"(.*?)"}', re.MULTILINE | re.DOTALL)
In [4]: import scrapy
In [5]: selector = scrapy.Selector(text=pattern.search(response.body).group(1))
In [6]: # TODO: now use the selector to extract the desired data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM