简体   繁体   中英

How can i export scraped data to csv file in the right format?

I made an improvement to my code according to this suggestion from @paultrmbrth. what i need is to scrape data from pages that are similar to this and this one and i want the csv output to be like the picture below. 在此处输入图片说明

But my code's csv output is little messy, like this: 在此处输入图片说明

I have two questions, Is there anyway that the csv output can be like the first picture? and my second question is, i want the movie tittle to be scrapped too, Please give me a hint or provide to me a code that i can use to scrape the movie title and the contents.

UPDATE
The problem has been solved by Tarun Lalwani perfectly. But Now, the csv File's Header only contains the first scraped url categories. for example when i try to scrape this webpage which has References, Referenced in, Features, Featured in and Spoofed in categories and this webpage which has Follows, Followed by, Edited from, Edited into, Spin-off, References, Referenced in, Features, Featured in, Spoofs and Spoofed in categories then the csv output file header will only contain the first webpage's categories ie References, Referenced in, Features, Featured in and Spoofed in so some categories from the 2nd webpage like Follows, Followed by, Edited from, Edited into and Spoofs will not be on the output csv file header so is its contents.
Here is the code i used:

import scrapy


class ExampleSpider(scrapy.Spider):
    name = "example"
    allowed_domains = ["imdb.com"]
    start_urls = (
        'http://www.imdb.com/title/tt0093777/trivia?tab=mc&ref_=tt_trv_cnn',
        'http://www.imdb.com/title/tt0096874/trivia?tab=mc&ref_=tt_trv_cnn',
    )

    def parse(self, response):
        item = {}
        for cnt, h4 in enumerate(response.css('div.list > h4.li_group'), start=1):
            item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()
            key = h4.xpath('normalize-space()').get().strip()
            if key in ['Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
                       'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
                       'Features']:
                values = h4.xpath('following-sibling::div[count(preceding-sibling::h4)=$cnt]', cnt=cnt).xpath(
                    'string(.//a)').getall(),
                item[key] = values
        yield item

and here is exporters.py file:

try:
    from itertools import zip_longest as zip_longest
except:
    from itertools import izip_longest as zip_longest
from scrapy.exporters import CsvItemExporter
from scrapy.conf import settings


class NewLineRowCsvItemExporter(CsvItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
        super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

        fields = self._get_serialized_fields(item, default_value='',
                                             include_empty=True)
        values = list(self._build_row(x for _, x in fields))

        values = [
            (val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
            if type(val) in (list, tuple)
            else (val, )
            for val in values]

        multi_row = zip_longest(*values, fillvalue='')

        for row in multi_row:
            self.csv_writer.writerow([unicode(s).encode("utf-8") for s in row])

What I'm trying to achieve is i want all these categories to be on the csv output header.

'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from', 'Features'   

Any help would be appreciated.

You can extract the title using below

item = {}
item['Title'] = response.css("h3[itemprop='name'] a::text").extract_first()

For the CSV part you would need to create a FeedExports which can split each row into multiple rows

from itertools import zip_longest
from scrapy.contrib.exporter import CsvItemExporter


class NewLineRowCsvItemExporter(CsvItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
        super(NewLineRowCsvItemExporter, self).__init__(file, include_headers_line, join_multivalued, **kwargs)

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

        fields = self._get_serialized_fields(item, default_value='',
                                             include_empty=True)
        values = list(self._build_row(x for _, x in fields))

        values = [
            (val[0] if len(val) == 1 and type(val[0]) in (list, tuple) else val)
            if type(val) in (list, tuple)
            else (val, )
            for val in values]

        multi_row = zip_longest(*values, fillvalue='')

        for row in multi_row:
            self.csv_writer.writerow(row)

Then you need to assign the feed exporter in your settings

FEED_EXPORTERS = {
    'csv': '<yourproject>.exporters.NewLineRowCsvItemExporter',
}

Assuming you put the code in exporters.py file. The output will be as desired

导出数据

Edit-1

To set the fields and their order you will need to define FEED_EXPORT_FIELDS in your settings.py

FEED_EXPORT_FIELDS = ['Title', 'Follows', 'Followed by', 'Edited into', 'Spun-off from', 'Spin-off', 'Referenced in',
                       'Featured in', 'Spoofed in', 'References', 'Spoofs', 'Version of', 'Remade as', 'Edited from',
                       'Features']

https://doc.scrapy.org/en/latest/topics/feed-exports.html#std:setting-FEED_EXPORT_FIELDS

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM