Scrapy 数据存储在 csv 个动态文件名的文件中

Question

I am trying to scrape data from different urls and I want to save the data in csv files with filename as the top level domain of the scraped url.我正在尝试从不同的 url 抓取数据，我想将数据保存在 csv 文件中，文件名作为抓取的 url 的顶级域。

For example if am scraping data from https://www.example.com/event/abc then the saved file name should be example.com .例如，如果我从https://www.example.com/event/abc抓取数据，那么保存的文件名应该是example.com 。 The data is scraping in proper way, but I was not successful in saving the file with proper filename数据正在以正确的方式抓取，但我没有成功地用正确的文件名保存文件

Code代码

class myCrawler(CrawlSpider):
    name = 'testing'
    rotate_user_agent = True
    base_url=''
    start_urls = []
    allowed_domains = ''
    handle_httpstatus_list = [404,403]

    custom_settings = {
            # in order to reduce the risk of getting blocked
            'DOWNLOADER_MIDDLEWARES': {'sitescrapper.sitescrapper.middlewares.RotateUserAgentMiddleware': 400,
                                       'sitescrapper.sitescrapper.middlewares.ProjectDownloaderMiddleware': 543, },
            'COOKIES_ENABLED': False,
            'CONCURRENT_REQUESTS': 6,
            'DOWNLOAD_DELAY': 2,
            'DEPTH_LIMIT' : 1,
            'CELERYD_MAX_TASKS_PER_CHILD' : 1,

            # Duplicates pipeline
            'ITEM_PIPELINES': {'sitescrapper.sitescrapper.pipelines.DuplicatesPipeline': 300},

            # In order to create a CSV file:
            'FEEDS': {'%(allowed_domains).csv': {'format': 'csv'}},
        }
    def __init__(self, category='', **kwargs):
        self.base_url = category
        self.allowed_domains = ['.'.join(urlparse(self.base_url).netloc.split('.')[-2:])]
        self.start_urls.append(self.base_url)
        print(f"Base url is {self.base_url} and allowed domain is {self.allowed_domains}")  

        self.rules = (
            Rule(
                LinkExtractor(allow_domains=self.allowed_domains),
                process_links=process_links,
                callback='parse_item',
                follow=True
            ),
        )   
        super().__init__(**kwargs)

Thanks in advance提前致谢

Answer 1

We can specify the location of download and set the filename dynamically by using我们可以指定下载位置并使用动态设置文件名

'FEEDS': {"./scraped_urls/%(file_name)s" : {"format": "csv"}},

in custom_settings在custom_settings中

Answer 2

If you're using the split("\") function on your url?如果您在 url 上使用 split("\") function？

Scrapy 数据存储在 csv 个动态文件名的文件中

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-02-22 14:25:42

解决方案2
0 2022-02-22 08:27:13

Scrapy 数据存储在 csv 个动态文件名的文件中

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-02-22 14:25:42

解决方案2 0 2022-02-22 08:27:13

解决方案1
1 已采纳 2022-02-22 14:25:42

解决方案2
0 2022-02-22 08:27:13