简体   繁体   中英

if statement not working for spider in scrapy

I am a python/scrapy newbie. I am trying to scrape a website for practice and basically what I am trying to accomplish is to pull all the companies that are active and download them to a CSV file. You can see my code pasted below I added an IF statement and it doesnt seem to be working and I am not sure what I am doing wrong.

Also I think the spider is crawling the website multiple times based on its output. I only want it to crawl the site once every time I run it.

Just an FYI I did search stackoverflow for the answer and I found a few solutions but I couldn't get any of them to work. I guess this is part of being a rookie.

from scrapy.spider import Spider
from scrapy.selector import Selector
from bizzy.items import BizzyItem

class SunSpider(Spider):
    name = "Sun"
    allowed_domains = ['sunbiz.org']
    start_urls = [
    'http://search.sunbiz.org/Inquiry/CorporationSearch/SearchResults/EntityName/a/Page1'
]


    def parse(self, response):
        sel = Selector(response)
        sites = sel.xpath('//tbody/tr')
        items = []
        for site in sites:
            item = BizzyItem()
            item["company"] = sel.xpath('//td[1]/a/text()').extract()
            item["status"] = sel.xpath('//td[3]/text()').extract()
            if item["status"] != 'Active':
                pass
            else:
                items.append(item)
        return items

Crawling Multiple Times?

I've had time now to read over your code and glance at the source code for the site you are trying to scrape. First of all, I can tell you from my admittedly limited experience with Scrapy that your spider is not crawling the website multiple times. What you are experiencing is simply the nightmarish wall of debugging output the scrapy devs decided it was a good idea to spew by default. :)

It's actually very useful information if you read through it, and if you can learn to spot patterns you can almost read it as it's whizzing by. I believe they properly use stderr so if you are in a Unix-y environment you can always silence it with scrapy crawl myspider -o output.json -t json 2&>/dev/null (IIRC).

Mysterious if Statement

Because of the nature of extract operating over selectors that might well return multiple elements, it returns a list . If you were to print your result, even though in the xpath you selected down to text() , you would find it looked like this:

[u'string']  # Note the brackets
#^ no little u if you are running this with Python 3.x

You want the first element (only member) of that list, [0] . Fortunately, you can add it right to the method chain you have already constructed for extract :

item["company"] = sel.xpath('//td[1]/a/text()').extract()[0]
item["status"] = sel.xpath('//td[3]/text()').extract()[0]

Then (assuming your xpath is correct - I didn't check it), your conditional should behave as expected. (A list of any size will never equal a string, so you always pass .)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM