簡體   English   中英

將來自 Scrapy 蜘蛛的數據分配給變量

[英]Assigning data from Scrapy spider to a variable

我在腳本中運行 Scrapy 蜘蛛,我想將抓取的數據分配給一個變量,而不是 output 到一個文件,然后讀取該文件以獲取數據。

現在蜘蛛正在將數據輸出到 json 文件,然后我讀取這些數據,根據需要排列數據,然后從被刪除的蜘蛛中刪除 json 文件(主要是因為我不知道如何覆蓋蜘蛛輸出)。 這可以工作並且可以滿足我的要求,但絕對看起來很蠻力,有沒有一種更有效的方法可以訪問蜘蛛數據而不必先將 output 轉到 json?


這是我的代碼

class SpiderManager:

    def __init__(self):
        self.run_spider()
        self.compile_json_data()

    @staticmethod
    def write_json(data, filename="quote_data.json"):
        """Write data to JSON file"""

        with open(filename, "w") as f:
            json.dump(data, f, indent=4)

    @staticmethod
    def read_json(filename="quote_data.json"):
        """Get data from JSON file"""
        try:
            with open(filename) as json_file:
                data = json.load(json_file)
        except FileNotFoundError:
            data = OrderedDict()
        except ValueError:
            data = []
        return data

    @staticmethod
    def compile_json_data(spider_file="quotes_spider.json"):
        """Read the data from the spider & created an OrderedDict"""

        spider_data = SpiderManager.read_json(spider_file)
        spider_data = sorted(spider_data, key=itemgetter("dob"))
        ordered_data = OrderedDict()
        for author_quote in spider_data:
            ordered_data.update({author_quote["author"]: author_quote["quote"]})

        SpiderManager.write_json(ordered_data, filename="quotes_dict.json")
        try:
            Path.cwd() / Path(spider_file).unlink()
        except (FileNotFoundError, TypeError) as e:
            pass

    def run_spider(self):
        """Run the spider"""
        process = CrawlerProcess({"FEED_FORMAT": "json",
                                  "FEED_URI": "quotes_spider.json",
                                  })
        process.crawl(MySpider)
        process.start()


class MySpider(scrapy.Spider):
    name = "quotes"
    temp_data = {}

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        quote_blocks = response.css("div.quote")
        for quote_block in quote_blocks:
            url = quote_block.css("a::attr(href)").get()
            quote = quote_block.css("span::text").get().strip()

            yield response.follow(url, self.parse_crossword,
                                  cb_kwargs=dict(quote=quote))

    def parse_crossword(self, response, quote):
        author = response.css("h3::text").get().strip()
        dob = response.css("span.author-born-date::text").get()
        dob = datetime.strptime(dob, "%B %d, %Y")

        yield {
            "author": author,
            "dob": dob,
            "quote": quote
        }

if __name__ == '__main__':
    SpiderManager()

Items & ItemPipelines 是我實現這一目標所需要的。 我在 SpiderManager 中創建了一個 class 變量,然后使用管道將 append 的每個項目傳遞給 class 變量。 下面是我的代碼,我添加了一個 Item class 和 Pipeline class,並指定了管道 CrawlerProcess

class SaveItemPipeline:
    """Append item to list in SpiderManager"""
    def process_item(self, item, spider):
        SpiderManager.spider_data.append(item)


class MyItem(Item):
    author = Field()
    dob = Field()
    quote = Field()


...


class SpiderManager:
    ...

    def run_spider(self):
     """Run the spider"""
        process = CrawlerProcess({
            "ITEM_PIPELINES": {SaveItemPipeline: 100},
                                  })
        process.crawl(MySpider)
        process.start()

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM