Scrapy 數據存儲在 csv 個動態文件名的文件中

Question

我正在嘗試從不同的 url 抓取數據，我想將數據保存在 csv 文件中，文件名作為抓取的 url 的頂級域。

例如，如果我從https://www.example.com/event/abc抓取數據，那么保存的文件名應該是example.com 。 數據正在以正確的方式抓取，但我沒有成功地用正確的文件名保存文件

代碼

class myCrawler(CrawlSpider):
    name = 'testing'
    rotate_user_agent = True
    base_url=''
    start_urls = []
    allowed_domains = ''
    handle_httpstatus_list = [404,403]

    custom_settings = {
            # in order to reduce the risk of getting blocked
            'DOWNLOADER_MIDDLEWARES': {'sitescrapper.sitescrapper.middlewares.RotateUserAgentMiddleware': 400,
                                       'sitescrapper.sitescrapper.middlewares.ProjectDownloaderMiddleware': 543, },
            'COOKIES_ENABLED': False,
            'CONCURRENT_REQUESTS': 6,
            'DOWNLOAD_DELAY': 2,
            'DEPTH_LIMIT' : 1,
            'CELERYD_MAX_TASKS_PER_CHILD' : 1,

            # Duplicates pipeline
            'ITEM_PIPELINES': {'sitescrapper.sitescrapper.pipelines.DuplicatesPipeline': 300},

            # In order to create a CSV file:
            'FEEDS': {'%(allowed_domains).csv': {'format': 'csv'}},
        }
    def __init__(self, category='', **kwargs):
        self.base_url = category
        self.allowed_domains = ['.'.join(urlparse(self.base_url).netloc.split('.')[-2:])]
        self.start_urls.append(self.base_url)
        print(f"Base url is {self.base_url} and allowed domain is {self.allowed_domains}")  

        self.rules = (
            Rule(
                LinkExtractor(allow_domains=self.allowed_domains),
                process_links=process_links,
                callback='parse_item',
                follow=True
            ),
        )   
        super().__init__(**kwargs)

提前致謝

Answer 1

我們可以指定下載位置並使用動態設置文件名

'FEEDS': {"./scraped_urls/%(file_name)s" : {"format": "csv"}},

在custom_settings中

Answer 2

如果您在 url 上使用 split("\") function？

Scrapy 數據存儲在 csv 個動態文件名的文件中

問題描述

2 個解決方案

解決方案1
1 已采納 2022-02-22 14:25:42

解決方案2
0 2022-02-22 08:27:13

Scrapy 數據存儲在 csv 個動態文件名的文件中

問題描述

2 個解決方案

解決方案1 1 已采納 2022-02-22 14:25:42

解決方案2 0 2022-02-22 08:27:13

解決方案1
1 已采納 2022-02-22 14:25:42

解決方案2
0 2022-02-22 08:27:13