[英]Scrapy Data Storing in csv files with dynamic file names
I am trying to scrape data from different urls and I want to save the data in csv files with filename as the top level domain of the scraped url.我正在尝试从不同的 url 抓取数据,我想将数据保存在 csv 文件中,文件名作为抓取的 url 的顶级域。
For example if am scraping data from https://www.example.com/event/abc
then the saved file name should be example.com
.例如,如果我从https://www.example.com/event/abc
抓取数据,那么保存的文件名应该是example.com
。 The data is scraping in proper way, but I was not successful in saving the file with proper filename数据正在以正确的方式抓取,但我没有成功地用正确的文件名保存文件
Code代码
class myCrawler(CrawlSpider):
name = 'testing'
rotate_user_agent = True
base_url=''
start_urls = []
allowed_domains = ''
handle_httpstatus_list = [404,403]
custom_settings = {
# in order to reduce the risk of getting blocked
'DOWNLOADER_MIDDLEWARES': {'sitescrapper.sitescrapper.middlewares.RotateUserAgentMiddleware': 400,
'sitescrapper.sitescrapper.middlewares.ProjectDownloaderMiddleware': 543, },
'COOKIES_ENABLED': False,
'CONCURRENT_REQUESTS': 6,
'DOWNLOAD_DELAY': 2,
'DEPTH_LIMIT' : 1,
'CELERYD_MAX_TASKS_PER_CHILD' : 1,
# Duplicates pipeline
'ITEM_PIPELINES': {'sitescrapper.sitescrapper.pipelines.DuplicatesPipeline': 300},
# In order to create a CSV file:
'FEEDS': {'%(allowed_domains).csv': {'format': 'csv'}},
}
def __init__(self, category='', **kwargs):
self.base_url = category
self.allowed_domains = ['.'.join(urlparse(self.base_url).netloc.split('.')[-2:])]
self.start_urls.append(self.base_url)
print(f"Base url is {self.base_url} and allowed domain is {self.allowed_domains}")
self.rules = (
Rule(
LinkExtractor(allow_domains=self.allowed_domains),
process_links=process_links,
callback='parse_item',
follow=True
),
)
super().__init__(**kwargs)
Thanks in advance提前致谢
We can specify the location of download and set the filename dynamically by using我们可以指定下载位置并使用动态设置文件名
'FEEDS': {"./scraped_urls/%(file_name)s" : {"format": "csv"}},
in custom_settings
在custom_settings
中
If you're using the split("\") function on your url?如果您在 url 上使用 split("\") function?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.