简体   繁体   中英

How to export scrapped data using FEEDS/FEED EXPORTS in scrapy

I'm new to webscraping/scrapy and python

Scrapy version: Scrapy 2.5.1 OS: windows IDE: pycharm

I am trying to use FEEDS option in scrapy to automatically export the scrapped data from a website to download into excel

Tried following solution but didn't work stackoverflow solution not sure what i'm doing wrong here am i missing something?

我机器上的代码

i also tried to add the same in my settings.py file after commenting custom_settings in my spider class as per example provided in documentation: https://docs.scrapy.org/en/latest/topics/feed-exports.html?highlight=feed#feeds

for now i achieved my requirement using spider_closed (signal) to write data to CSV by storing all the scraped items data in a array called result

class SpiderFC(scrapy.Spider):
name = "FC"

start_urls = [
     url,
]
custom_setting = {"FEEDS": {r"C:\Users\rreddy\PycharmProjects\fcdc\webscrp\outputfinal.csv": {"format": "csv", "overwrite": True}}}

@classmethod
def from_crawler(cls, crawler, *args, **kwargs):
    spider = super(SpiderFC, cls).from_crawler(crawler, *args, **kwargs)
    crawler.signals.connect(spider.spider_closed, signal=signals.spider_closed)
    return spider

def __init__(self, name=None):
    super().__init__(name)
    self.count = None

def parse(self, response, **kwargs):
    # each item scrapped from parent page has links where the actual data need to be scrapped so i follow each link and scrape data
      yield response.follow(notice_href_follow, callback=self.parse_item,
                                      meta={'item': item, 'index': index, 'next_page': next_page})
       
def parse_item(self, response):
    # logic for items to scrape goes here
    # they are saved to temp list and appended to result array and then temp list is cleared
        result.append(it) # result data is used at the end to write to csv
        item.clear()
        if next_page:
            yield next(self.follow_next(response, next_page))
        
            
def follow_next(self, response, next_page):
    next_page_url = urljoin(url, next_page[0])
    yield response.follow(next_page_url, callback=self.parse)

spider closed signal

def spider_closed(self, spider):
    with open(output_path, mode="a", newline='') as f:
        writer = csv.writer(f)
        for v in result:
            writer.writerow([v["city"]])

when all data is scraped and all requests are completed spider_closed signal will write the data to a csv but i'm trying to avoid this logic or code and use inbuilt exporter from scrapy but I'm having trouble in exporting the data

Check your path. If you are on windows then provide the full path in the custom_settings eg as below

custom_settings = {
    "FEEDS":{r"C:\Users\Name\Path\To\outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}

If you are on linux or MAC then provide the path as below:

custom_settings = {
    "FEEDS":{r"/Path/to/folder/fcdc/webscrp/outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}

Alternatively provide the relative path as below which will create a folder structure of fcdc>>webscrp>>outputfinal.csv in the directory from which the spider is run from.

custom_settings = {
    "FEEDS":{r"./fcdc/webscrp/outputfinal.csv" : {"format" : "csv", "overwrite":True}}
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM