简体   繁体   English

Scrapy 数据存储在 csv 个动态文件名的文件中

[英]Scrapy Data Storing in csv files with dynamic file names

I am trying to scrape data from different urls and I want to save the data in csv files with filename as the top level domain of the scraped url.我正在尝试从不同的 url 抓取数据,我想将数据保存在 csv 文件中,文件名作为抓取的 url 的顶级域。

For example if am scraping data from https://www.example.com/event/abc then the saved file name should be example.com .例如,如果我从https://www.example.com/event/abc抓取数据,那么保存的文件名应该是example.com The data is scraping in proper way, but I was not successful in saving the file with proper filename数据正在以正确的方式抓取,但我没有成功地用正确的文件名保存文件

Code代码

class myCrawler(CrawlSpider):
    name = 'testing'
    rotate_user_agent = True
    base_url=''
    start_urls = []
    allowed_domains = ''
    handle_httpstatus_list = [404,403]

    custom_settings = {
            # in order to reduce the risk of getting blocked
            'DOWNLOADER_MIDDLEWARES': {'sitescrapper.sitescrapper.middlewares.RotateUserAgentMiddleware': 400,
                                       'sitescrapper.sitescrapper.middlewares.ProjectDownloaderMiddleware': 543, },
            'COOKIES_ENABLED': False,
            'CONCURRENT_REQUESTS': 6,
            'DOWNLOAD_DELAY': 2,
            'DEPTH_LIMIT' : 1,
            'CELERYD_MAX_TASKS_PER_CHILD' : 1,

            # Duplicates pipeline
            'ITEM_PIPELINES': {'sitescrapper.sitescrapper.pipelines.DuplicatesPipeline': 300},

            # In order to create a CSV file:
            'FEEDS': {'%(allowed_domains).csv': {'format': 'csv'}},
        }
    def __init__(self, category='', **kwargs):
        self.base_url = category
        self.allowed_domains = ['.'.join(urlparse(self.base_url).netloc.split('.')[-2:])]
        self.start_urls.append(self.base_url)
        print(f"Base url is {self.base_url} and allowed domain is {self.allowed_domains}")  

        self.rules = (
            Rule(
                LinkExtractor(allow_domains=self.allowed_domains),
                process_links=process_links,
                callback='parse_item',
                follow=True
            ),
        )   
        super().__init__(**kwargs)

Thanks in advance提前致谢

We can specify the location of download and set the filename dynamically by using我们可以指定下载位置并使用动态设置文件名

'FEEDS': {"./scraped_urls/%(file_name)s" : {"format": "csv"}},

in custom_settingscustom_settings

If you're using the split("\") function on your url?如果您在 url 上使用 split("\") function?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM