简体   繁体   中英

Saving data to seprate csv files in scrapy

I made a scraper for yellow pages. There is a categories.txt file that is read by the script and then it generates links according to those categories:

settings = get_project_settings()
categories = settings.get('CATEGORIES')



    links = []
    
    for category in categories:
        link = 'https://www.yellowpages.com/search?search_terms=' + category + '&geo_location_terms=NY&page=1'
        links.append(link)

then this links list is passed to start urls:

class YpSpider(CrawlSpider):

    def __init__(self, *a, **kw):
        super().__init__(*a, **kw)
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    name = 'yp'
    allowed_domains = ['yellowpages.com']
    
    start_urls = links

      rules = (

    Rule(LinkExtractor(restrict_xpaths='//a[@class="business-name"]', allow=''), callback='parse_item',
         follow=True),

    Rule(LinkExtractor(restrict_xpaths='//a[@class="next ajax-page"]', allow=''),
         follow=True),
)

it will save all the data from all the links in a csv file named parent.csv. This parent.csv file will have a column named keyword which will help in seperating data from different categories and make seperate csv files for each of them. This is implemented in spider closed function:

def spider_closed(self, spider):
    with open('parent.csv', 'r') as file:
        reader = csv.reader(file)
        for row in reader:
            with open('{}.csv'.format(row[0]), 'a') as f:
                writer = csv.writer(f)
                writer.writerow(row)

The problem i am facing is to get the category name in my parse method corresponding to every link so that it may be saved in parent.csv and used to seperate diiferent categories afterwards:

def parse_item(self, response):

    item = YellowItem()

    item['keyword'] = # here i need the corresponding category for every link 

I think you should change the way you generate links. You can, for example, override the start_requests method and pass the category to the request through either it's cb_kwargs or meta attribute. I would also suggest, that you change the implementation to get the settings from the crawler calling the spider by overriding from_crawler . Here's how I would do it:

class YpSpider(CrawlSpider):
    def __init__(self, crawler, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.crawler = crawler
        self.categories = crawler.settings.get('CATEGORIES')

    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)

    def start_requests(self):
        for category in self.categories:
            yield Request(
                f'https://www.yellowpages.com/search?search_terms={category}&geo_location_terms=NY&page=1',
                self.parse,
                cb_kwargs={'category': category}
            )

    def parse(self, response, category):
        # do sth
        ...
        for url in response.xpath('//a[@class="business-name"]/@href').extract():
            yield Request(
                url, 
                self.parse_item, 
                cb_kwargs={'category': category}
            )
        yield Request(
            response.xpath('//a[@class="next ajax-page"]/@href').extract_first(),
            self.parse,
            cb_kwargs={'category': category}
        )

    def parse_item(self, response, category):
        item = YellowItem()
        item['keyword'] = category
        # do sth else
        ...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM