简体   繁体   中英

Scrapy not returning all the items it should

I'm trying to get Scrapy to crawl through a website, but constrain it only to pages that match a certain pattern, and it's giving me a headache.

The website is structured like this:

website.com/category/page1/
website.com/category/page2/
website.com/category/page3/

And so on.

I need it to start crawling from category and then follow all the links that lead to another page (there ar 375 pages total, and the number is not fixed, of course).

The problem is that it crawls through ~10 pages before I stop it, but it only returns 10-15 items, where there should be 200+

Here is my code, which doesn't work right:

class WSSpider(CrawlSpider):
name = "ws"
allowed_domains = ["website.com"]
start_urls = ["https://www.website.com/category/"]
rules = (
    Rule(LinkExtractor(allow=("/level_one/page*",)), callback="parse_product", follow=True),
)

    def parse_product(self, response):
        sel = Selector(response)
        sites = sel.css(".pb-infos")
        items = []

        for site in sites:
            item = Website()
            item["brand"] = site.css(".pb-name .pb-mname::text").extract()
            item["referinta"] = site.css(".pb-name a::text").extract()
            item["disponibilitate"] = site.css(".pb-availability::text").extract()
            item["pret_vechi"] = site.css(".pb-sell .pb-old::text").extract()
            item["pret"] = site.css(".pb-sell .pb-price::text").extract()
            item["procent"] = site.css(".pb-sell .pb-savings::text").extract()
            items.append(item)

        #return items
        f = open("output.csv", "w")
        for item in items:
            line = \
                item["brand"][0].strip(), ";", \
                item["referinta"][-1].strip(), ";", \
                item["disponibilitate"][0].strip(), ";", \
                item["pret_vechi"][0].strip().strip(" lei"), ";", \
                item["pret"][0].strip().strip(" lei"), ";", \
                item["procent"][0].strip().strip("Mai ieftin cu "), "\n"
            f.write("".join(line))
        f.close()

Any help is much appreciated!

I found my (stupid) mistake.

f = open("output.csv", "w")

should in fact be

f = open("output.csv", "a")

I once wrote a python scraper to download an internal wiki site before it closed - ran into a problem that our intranet or the wiki server was throttling my script's access to the content. I think there is a way of telling scrapy to access more slowly.

The other problem I had was with authentication - some parts of the wiki required a login before they could be read.

And the other problem was going to be that you are over-writing output.csv every time...

parse_product is async, use CsvItemExporter instead. http://doc.scrapy.org/en/latest/topics/exporters.html#csvitemexporter

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM