简体   繁体   中英

Storing in csv through Scrapy

I am learning web scraping through Scrapy. I have been stuck in the issue as follows:

Why does CSV file open the whole data in one row? Indeed it should have 8 rows and 4 columns. It is ok with columns but I couldn't understand why it opens the data in only one row.

import scrapy


class MyredditSpider(scrapy.Spider):
    name = 'myreddit'
    allowed_domains = ['reddit.com']
    start_urls = ['http://www.reddit.com/']
    #custom_settings = {
   #"FEEDS":{"result.csv":{"format":"csv",}}
  #}

    def parse(self, response):
        all_var=response.xpath("//div[@class='rpBJOHq2PR60pnwJlUyP0']")
        
        for variable in all_var:
            post= variable.xpath("//h3[@class='_eYtD2XCVieq6emjKBH3m']/text()").extract()
            vote= variable.xpath("//div[@class='_1rZYMD_4xY3gRcSS3p8ODO _3a2ZHWaih05DgAOtvu6cIo ']/text()").extract()
            time=variable.xpath("//span[@class='_2VF2J19pUIMSLJFky-7PEI']/text()").extract()
            links= variable.xpath("//a[@data-click-id='body']/@href").extract()
        
       
        

            yield{"Posts": post, "Votes": vote, "Time": time, "Links":links}
        

I used scrapy crawl myreddit -o items.csv to save the data in csv. I want to get CSV that every value in a row accordingly. Almost like in the image

Your code looks good and it is working exactly as it is supposed to. yield is a term as a single row. whenever you use yield in code as output it will be treated as a single row. The example below will output two rows.

yield{"Posts": post}
yield {"Votes": vote, "Time": time, "Links":links}

Probably scrapy open the whole datasheet into one because the delimiter arguments are not correct. The default argument for delimiter is set to "," and separates the data by a commas.

Keep in mind that if the scraped data contain commas, the slipping process with result in undesired columns.

If this is the case you can use something like "|"or "\t" as delimiters.

On the other hand, the issue might be with your l.neterminator argument which indicates how each row of data should be terminated. The default value of this argument is "\r\n" .

If your data does not use these characters as l.neterminators you can use something else, like "\n" , to make sure that the data is separated into different rows.

See example below:

from scrapy.exporters import CSVItemExporter

# Use the "|" as the delimiter and "\n" as the lineterminator
exporter = CSVItemExporter(delimiter="|", lineterminator="\n")

exporter.export_item(item)

It is because that is the way you are extracting the information...

Each of your extract() calls is pulling all of those elements that are on the page all at once, if you want to have them listed row by row you will want to iterate through the html elements row by row as well.

For example it should look closer to this where it iterates through each of the rows and extracts the information from each row and yields it and then moves on to the next one.

import scrapy


class MyredditSpider(scrapy.Spider):
    name = 'myreddit'
    allowed_domains = ['reddit.com']
    start_urls = ['http://www.reddit.com/']

    def parse(self, response):
        for row in response.xpath('//div[@class="rpBJOHq2PR60pnwJlUyP0"]/div'):
            post = row.xpath(".//h3[@class='_eYtD2XCVieq6emjKBH3m']/text()").get()
            vote = row.xpath(".//div[@class='_1rZYMD_4xY3gRcSS3p8ODO _3a2ZHWaih05DgAOtvu6cIo ']/text()").get()
            time = row.xpath(".//span[@class='_2VF2J19pUIMSLJFky-7PEI']/text()").get()
            links = row.xpath(".//a[@data-click-id='body']/@href").get()
            yield {"Posts": post, "Votes": vote, "Time": time, "Links":links}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM