简体   繁体   中英

Index error with Javascript parser

I am using Scrapy and the Javascript parsing module 'slimit' to look for a particular Javascript item within pages that I am crawling, like so:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


def get_fields(data):
    parser = Parser()
    tree = parser.parse(data)
    return {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
            for node in nodevisitor.visit(tree)
            if isinstance(node, ast.Assign)}


class ExampleSpider(CrawlSpider):
    name = "goal2"
    allowed_domains = ["whoscored.com"]
    start_urls = ["http://www.whoscored.com/"]


    rules = [Rule(SgmlLinkExtractor(allow=(''),deny=('')]

    def parse_item(self, response):

        script = sel.xpath('//div[@id="team-stage-stats"]/following-sibling::script/text()')
        if script is not None:
            script = script.extract()[0]

This works fine as long as the item is found on a page crawled. If it isn't I get an error that the list index is out of range. I thought the 'is not None:' statement would sort this, but it appears that this is not the case.

Can anyone see what I am doing wrong?

Thanks

It's likely that your xpath call is returning an empty list instead of None . Changing your check to

if script is not None and len(script) > 0:  

should fix the issue. Or more simply, you could rely on the truthiness with just

if script:

Since both None and [] are falsy values. This does the same thing as its longer counterpart.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM