简体   繁体   中英

export scrapy to csv

I am going to scrape 'healthunblock.com'; I don't know why I cannot see the extracted data in the CSV file.

class HealthSpider(scrapy.Spider):
    name = 'health'
    #allowed_domains = ['https://healthunlocked.com/positivewellbeing/posts#popular']
    start_urls = ['https://healthunlocked.com/positivewellbeing/posts#popular']
    itemlist=[]

    def parse(self, response):
        
        all_div_posts = response.xpath('//div[@class="results-posts"]')
        
        for posts in all_div_posts:
            items={} 
            items['title']= posts.xpath('//h3[@class="results-post__title"]/text()').extract()
            items['post']= posts.xpath('//div[@class="results-post__body hidden-xs"]/text()').extract()
            self.itemlist.append(items)
          
           
        with open("outputfile.csv","w", newline="") as f:
            writer = csv.DictWriter(f,['title','post'])
            writer.writeheader()
            for data in self.itemlist:
                writer.writerow(data)

EDIT: I run your code and it gives me file with results.


Scrapy can built-it function to save result in CSV and you don't have to write on your own.

You have to only yield items

def parse(self, response):
    
    all_div_posts = response.xpath('//div[@class="results-posts"]')
    
    for posts in all_div_posts:
        items = {} 
        items['title']= posts.xpath('//h3[@class="results-post__title"]/text()').extract()
        items['post']= posts.xpath('//div[@class="results-post__body hidden-xs"]/text()').extract()

        yield items

and run with option -o outputfile.csv

scrapy runspider your_spider.py -o outputfile.csv

EDIT:

I made some changes and now both version gives the same result - I checked it using program diff to compare both csv .

Because I organized items in different way so I could use directly writer.writerows(self.itemlist) without for -loop (and zip() )

I also use .get() instead of extract() (or extract_first() ) to get single title and single post to create pair. And I can use strip() to clear spaces.

Version 1

import scrapy
import csv

class HealthSpider(scrapy.Spider):
    name = 'health'
    #allowed_domains = ['https://healthunlocked.com/positivewellbeing/posts#popular']
    start_urls = ['https://healthunlocked.com/positivewellbeing/posts#popular']
    
    itemlist = []

    def parse(self, response):
        
        all_div_posts = response.xpath('//div[@class="results-post"]')
        print('len(all_div_posts):', len(all_div_posts))
        
        for one_post in all_div_posts:
            #print('>>>>>>>>>>>>')
            one_item = {
                'title': one_post.xpath('.//h3[@class="results-post__title"]/text()').get().strip(),
                'post': one_post.xpath('.//div[@class="results-post__body hidden-xs"]/text()').get().strip(),
            }
            self.itemlist.append(one_item)

            #yield one_item
          
                   
        with open("outputfile.csv", "w", newline="") as f:
            writer = csv.DictWriter(f, ['title','post'])
            writer.writeheader()           
            writer.writerows(self.itemlist)

Version 2

import scrapy

class HealthSpider(scrapy.Spider):
    name = 'health'
    #allowed_domains = ['https://healthunlocked.com/positivewellbeing/posts#popular']
    start_urls = ['https://healthunlocked.com/positivewellbeing/posts#popular']
    
    #itemlist = []

    def parse(self, response):
        
        all_div_posts = response.xpath('//div[@class="results-post"]')
        print('len(all_div_posts):', len(all_div_posts))
        
        for one_post in all_div_posts:
            #print('>>>>>>>>>>>>')
            one_item = {
                'title': one_post.xpath('.//h3[@class="results-post__title"]/text()').get().strip(),
                'post': one_post.xpath('.//div[@class="results-post__body hidden-xs"]/text()').get().strip(),
            }
            #self.itemlist.append(one_item)

            yield one_item
          
                   
        #with open("outputfile.csv", "w", newline="") as f:
        #    writer = csv.DictWriter(f, ['title','post'])
        #    writer.writeheader()           
        #    writer.writerows(self.itemlist)

Try the following to get the exact results that you see in that webpage. The content are dynamic and you need to populate the json content in order to fetch the required results. I've used customized approach to write the data in a csv file. If you go for the way below, the csv file will be opened once. However, the file will be closed after the data are written to it.

import csv
import json
import scrapy

class HealthSpider(scrapy.Spider):
    name = "health"
    start_urls = ['https://solaris.healthunlocked.com/posts/positivewellbeing/popular']

    def __init__(self):
        self.outfile = open("output.csv","w",newline="",encoding="utf-8-sig")
        self.writer = csv.writer(self.outfile)
        self.writer.writerow(['title','post content'])

    def close(self,reason):
        self.outfile.close()

    def parse(self,response):
        for posts in json.loads(response.body_as_unicode()):
            title = ' '.join(posts['title'].split())
            post = ' '.join(posts['bodySnippet'].split())
            self.writer.writerow([title,post])
            yield {'title':title,'post':post}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM