简体   繁体   中英

Trying to extract data from tables using Scrapy

I am using Python.org version 2.7 64 bit on Vista 64 bit. I have the current Scrapy code which is working pretty well now for extracting text, but I'm a bit stuck as to how get data from tables on websites. I've had a look online for answers but I'm still not sure. As an example, I would like to get the data contained in this table for Wayne Rooney's goalscoring stats:

http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney The code I currently have is this:

from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.utils.markup import remove_tags
from scrapy.cmdline import execute
import re


class MySpider(Spider):
    name = "Goals"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney"]

    def parse(self, response):
        titles = response.selector.xpath("normalize-space(//title)")
        for titles in titles:

            body = response.xpath("//p").extract()
            body2 = "".join(body)

            print remove_tags(body2).encode('utf-8')

execute(['scrapy','crawl','goals'])

What syntax do need to use in the xpath() statements to get tabular data?

Thanks

I just saw the page link and I got all rows of the table of tournaments you want throughout this Xpath expression: '//table[@id="player-fixture"]//tr[td[@class="tournament"]]' .

I'll try to explain each part of this Xpath expression:

  • //table[@id="player-fixture"] : retrieve the whole table with the id attribute player-fixture as you can inspect in that page.
  • //tr[td[@class="tournament"]] : retrive all rows with the information of each match you want.

You can use just this shorter //tr[td[@class="tournament"]] Xpath expression as well. But I think is more consistent to use the prior expression as you are stating by that expression that you want all rows( tr ) under a certain table whose id is unique( player-fixture ).

Once you get all rows, you can loop over them to get all information you need from each row entry.

To scrape data, you usually identify the table, then loop over the rows. An html table like this one usually has this format:

<table id="thistable">
  <tr>
    <th>Header1</th>
    <th>Header2</th>
  </tr>
  <tr>
    <td>data1</td>
    <td>data2</td>
  </tr>
</table>

Here's an example of how to parse this fixture table:

from scrapy.spider import Spider
from scrapy.http import Request
from myproject.items import Fixture

class GoalSpider(Spider):
    name = "goal"
    allowed_domains = ["whoscored.com"]
    start_urls = (
        'http://www.whoscored.com/',
        )

    def parse(self, response):
        return Request(
            url="http://www.whoscored.com/Players/3859/Fixtures/Wayne-Rooney",
            callback=self.parse_fixtures
        )

    def parse_fixtures(self,response):
        sel = response.selector
        for tr in sel.css("table#player-fixture>tbody>tr"):
             item = Fixture()
             item['tournament'] = tr.xpath('td[@class="tournament"]/span/a/text()').extract()
             item['date'] = tr.xpath('td[@class="date"]/text()').extract()
             item['team_home'] = tr.xpath('td[@class="team home "]/a/text()').extract()
             yield item

First, I identify the data rows with sel.css("table#player-fixture>tbody>tr") and loop over the results, then extract data.

Edit: items.py ( http://doc.scrapy.org/en/latest/topics/items.html )

class Fixture(Item):
    tournament = Field()
    date = Field()
    team_home = Field()

At first of all, for each symbol that you want you have to know what is the name associate with this symbol. For example, for goals I saw a <span> element with title attribute equals "Goal" as well as a <span> element with title attribute equals "Assist" for the symbol assist.

Considering these informations, you could check for each row retrieved if it contains a span with a desired title name that is associate with the symbol that you want to retrieve.

To get all Goals symbols of a row you could eval this row using the expression //span[@title="Goal" as bellow:

for row in response.selector.xpath(
            '//table[@id="player-fixture"]//tr[td[@class="tournament"]]'):
    # Is this row contains goal symbols?
    list_of_goals = row.xpath('//span[@title="Goal"')
    if list_of_goals:
        # Output goals text.
    .
    .
    .

If it has retrieved a no empty list, it means there are goals symbols inside this row. So, you can output how many goals texts as many as the length of the returned list of spans, above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM