將 scrapy 導出到文本文件

Question

Is there a way to export scrapy data to a text file so that when the python script runs it generates a text file without having to go through the terminal to execute scrapy?

代碼示例

class NameListSpider(CrawlSpider):
    name = 'namelist'
    allowed_domains = ['namelist.com']
    start_urls = ['http://www.namelist.com']

    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="post-outer"]/a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            'name': response.xpath('//div[@class="alt"]/span/span[2]/text()').get()
        }

# have added the below as an example
with open("file.txt", "a") as file: 
    file.write(name)

Answer 1

實現此結果的方法不止一種。
如果您想使用scrapy crawl運行您的項目，您可以在 settings 中配置提要。
如果您想使用python your_python_script.py運行它，您還需要傳遞設置。
您甚至可以將不同的項目導出到不同的文件。 為此，請在 github 上查看此管道

現在使用python your_script.py運行你的蜘蛛，你會做這樣的事情：

# -*- coding: utf-8 -*-
from scrapy.settings import Settings
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule, CrawlSpider

class NameListSpider(CrawlSpider):
    name = 'namelist'
    allowed_domains = ['namelist.com']
    start_urls = ['http://www.namelist.com']
    rules = (
        Rule(LinkExtractor(restrict_xpaths='//div[@class="post-outer"]/a'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        yield {
            'name': response.xpath('//div[@class="alt"]/span/span[2]/text()').get()
        }

def get_settings():
    settings = Settings()
    settings.set('FEED_URI', 'file.txt')
    settings.set('FEED_FORMAT', 'csv')
    return settings

if __name__ == '__main__':
    settings = get_settings()
    runner = CrawlerRunner(settings)
    d = runner.crawl(NameListSpider)
    d.addBoth(lambda _: reactor.stop())
    reactor.run()

將 scrapy 導出到文本文件

問題描述

1 個解決方案

解決方案1
3 已采納 2020-05-19 13:13:03

將 scrapy 導出到文本文件

問題描述

1 個解決方案

解決方案1 3 已采納 2020-05-19 13:13:03

解決方案1
3 已采納 2020-05-19 13:13:03