简体   繁体   中英

Basic Scrapy Spider with Authentication

I'm new to python, scrapy, and everything that's not shell scripting.

That said, I'm trying to write a scraper that pulls customer information from my Etsy store.

So far I've written:

from scrapy.spiders import BaseSpider, CrawlSpider, Rule
from scrapy.http import FormRequest
from loginform import fill_login_form
from scrapy.selector import HtmlXPathSelector
from scrapy.linkextractors.sgml import SgmlLinkExtractor

class LoginSpider(BaseSpider):
    name = "etsy"
    allowed_domains = ["etsy.com"]
    start_urls = ["https://www.etsy.com/signin"]
    login_user = "myuname"
    login_pass = "mypass"

    rules = (Rule(SgmlLinkExtractor(allow=("/your/orders/sold",))
    , callback="parse_items", follow=True),
    )

    def parse(self, response):
        args, url, method = fill_login_form(response.url,response.body,self.login_user, self.login_pass)
        return FormRequest(url, method=method,formdata=args,callback=self.parse_item)

    def after_login(self, response):
        if "avorites" in response.body:
            print 'logged in'
        else:
            print 'not logged in'
        return

    def parse_item(self, response):
#TBD

The problem I'm having is that anything I put in parse_item will simply parse the first page after login, but not anything after that.

I'm pretty sure I'm missing something really basic, but none of the examples out there really show me how to structure things with an authentication then crawl after auth. I can obviously follow tutorials and can do each of those independently, I'm just not sure how to structure things so they would work together to login, then browse all pages under /your/orders/sold.

Even if someone could point me to a working example, that would be extremely useful.

the rule only assign urls with extact path '/your/orders/sold' to parse_item. subsequent page will not be parsed at all if they don't match this rule.

There are 2 possibilities :

  1. set follow= False in your rules and then manually extract links and create Request objects in the parse_item function.

  2. refine the allow parameter and more terms to match other urls

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM