简体   繁体   中英

Python: Scrapy spider doesn't return results?

I know I need to work on my selectors in order to tune in on more specific data, but I don't know why my csv is EMPTY.

my parse class:

class MySpider(BaseSpider):
    name =  "wikipedia"
    allowed_domains = ["en.wikipedia.org/"]
    start_urls = ["http://en.wikipedia.org/wiki/2014_in_film"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select('//table[@class="wikitable sortable jquery-tablesorter"], [@style="margin:auto; margin:auto;"]')
        items = []
        for title in titles:
            item = WikipediaItem()
            item["title"] = title.select("td/text()").extract()
            item["url"] = title.select("a/text()").extract()
            items.append(item)
        return items

The html I'm trying to crawl:

<table class="wikitable sortable" style="margin:auto; margin:auto;">
<caption>Highest-grossing films of 2014</caption>
<tr>
<th>Rank</th>
<th>Title</th>
<th>Studio</th>
<th>Worldwide gross</th>
</tr>
<tr>
<th style="text-align:center;">1</th>
<td><i><a href="/wiki/Transformers:_Age_of_Extinction" title="Transformers: Age of Extinction">Transformers: Age of Extinction</a></i></td>
<td><a href="/wiki/Paramount_Pictures" title="Paramount Pictures">Paramount Pictures</a></td>
<td>$1,091,404,499</td>
</tr>

And this section within the html repeats over and over for each film, so it should grab all once selected correctly:

    <tr>
    <th style="text-align:center;">1</th>
    <td><i><a href="/wiki/Transformers:_Age_of_Extinction" title="Transformers: Age of Extinction">Transformers: Age of Extinction</a></i></td>
    <td><a href="/wiki/Paramount_Pictures" title="Paramount Pictures">Paramount Pictures</a></td>
    <td>$1,091,404,499</td>
    </tr>

I know the issue isn't in exporting because even in my shell it says "Crawl 0 pages, Scraped 0 Items" so really nothing is getting touched.

  1. The table is not the repeatable element... it is the table row.

  2. You will need to change your code to select the table rows ie

     titles = hxs.select('//tr') 
  3. Then loop through them and use xpath to get your data

     for title in titles: item = WikipediaItem() item["title"] = title.xpath("./td/i/a/@title")[0] item["url"] = title.xpath("./td/i/a/@href")[0] items.append(item) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM