简体   繁体   English

cra草不退还应该的所有物品

[英]Scrapy not returning all the items it should

I'm trying to get Scrapy to crawl through a website, but constrain it only to pages that match a certain pattern, and it's giving me a headache. 我试图让Scrapy爬过一个网站,但仅将其限制在匹配特定模式的页面上,这让我很头疼。

The website is structured like this: 该网站的结构如下:

website.com/category/page1/
website.com/category/page2/
website.com/category/page3/

And so on. 等等。

I need it to start crawling from category and then follow all the links that lead to another page (there ar 375 pages total, and the number is not fixed, of course). 我需要它来开始从类别中进行爬网,然后跟踪通往另一页的所有链接(共有375页,当然数量不是固定的)。

The problem is that it crawls through ~10 pages before I stop it, but it only returns 10-15 items, where there should be 200+ 问题是它在我停止之前会爬行大约10个页面,但只返回10-15个项目,其中应该有200+

Here is my code, which doesn't work right: 这是我的代码,无法正常工作:

class WSSpider(CrawlSpider):
name = "ws"
allowed_domains = ["website.com"]
start_urls = ["https://www.website.com/category/"]
rules = (
    Rule(LinkExtractor(allow=("/level_one/page*",)), callback="parse_product", follow=True),
)

    def parse_product(self, response):
        sel = Selector(response)
        sites = sel.css(".pb-infos")
        items = []

        for site in sites:
            item = Website()
            item["brand"] = site.css(".pb-name .pb-mname::text").extract()
            item["referinta"] = site.css(".pb-name a::text").extract()
            item["disponibilitate"] = site.css(".pb-availability::text").extract()
            item["pret_vechi"] = site.css(".pb-sell .pb-old::text").extract()
            item["pret"] = site.css(".pb-sell .pb-price::text").extract()
            item["procent"] = site.css(".pb-sell .pb-savings::text").extract()
            items.append(item)

        #return items
        f = open("output.csv", "w")
        for item in items:
            line = \
                item["brand"][0].strip(), ";", \
                item["referinta"][-1].strip(), ";", \
                item["disponibilitate"][0].strip(), ";", \
                item["pret_vechi"][0].strip().strip(" lei"), ";", \
                item["pret"][0].strip().strip(" lei"), ";", \
                item["procent"][0].strip().strip("Mai ieftin cu "), "\n"
            f.write("".join(line))
        f.close()

Any help is much appreciated! 任何帮助深表感谢!

I found my (stupid) mistake. 我发现了(愚蠢的)错误。

f = open("output.csv", "w")

should in fact be 实际上应该是

f = open("output.csv", "a")

I once wrote a python scraper to download an internal wiki site before it closed - ran into a problem that our intranet or the wiki server was throttling my script's access to the content. 我曾经写过一个Python抓取工具,用于在关闭内部Wiki网站之前下载它-遇到了一个问题,即我们的Intranet或Wiki服务器正在限制我的脚本对内容的访问。 I think there is a way of telling scrapy to access more slowly. 我认为有一种方法可以告诉scrapy访问速度较慢。

The other problem I had was with authentication - some parts of the wiki required a login before they could be read. 我遇到的另一个问题是身份验证-Wiki的某些部分需要登录才能读取。

And the other problem was going to be that you are over-writing output.csv every time... 另一个问题是您每次都覆盖output.csv ...

parse_product is async, use CsvItemExporter instead. parse_product是异步的,请改用CsvItemExporter http://doc.scrapy.org/en/latest/topics/exporters.html#csvitemexporter http://doc.scrapy.org/en/latest/topics/exporters.html#csvitemexporter

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM