简体   繁体   中英

Python csv.DictReader - how to reverse output?

I'm trying to reverse the way a file is read. I am using DictReader because I want the contents in a Dictionary. I'd like to read the first line in the file and use that for the Keys, then parse the file in reverse (bottom to top) kind of like the linux "tac" command. Is there an easy way to do this? Below is my code to read the file into a dictionary and write it to a file...

reader = csv.DictReader(open(file_to_parse, 'r'), delimiter=',', quotechar='"')
for line in reader:
    # ...

This code works to process the file normally, however.. I need it to read the file from the end.

In other words, I'd like it to read the file:

fruit, vegetables, cars
orange, carrot, ford
apple, celery, chevy
grape, corn, chrysler

and be able to have it return:

{' cars': ' chrysler', ' vegetables': ' corn', 'fruit': 'grape'}
{' cars': ' chevy', ' vegetables': ' celery', 'fruit': 'apple'}
{' cars': ' ford', ' vegetables': ' carrot', 'fruit': 'orange'}

instead of:

{' cars': ' ford', ' vegetables': ' carrot', 'fruit': 'orange'}
{' cars': ' chevy', ' vegetables': ' celery', 'fruit': 'apple'}
{' cars': ' chrysler', ' vegetables': ' corn', 'fruit': 'grape'}

You'll have to read the whole CSV file into memory; you can do so by calling list() on the reader object:

with open(file_to_parse, 'rb') as inf:
    reader = csv.DictReader(inf, skipinitialspace=True)
    rows = list(reader)

for row in reversed(rows):

Note that I used the file as a context manager here to ensure that the file is closed. You also want to open the file in binary mode (leave newline handling to the csv module). The rest of the configuration you passed to the DictReader() are the defaults, so I omitted them.

I set skipinitialspace to True, as judging from your sample input and output you do have spaces after your delimiters; the option removes these.

The csv.DictReader() object takes care of reading that first line as the keys.

Demo:

>>> import csv
>>> sample = '''\
... fruit, vegetables, cars
... orange, carrot, ford
... apple, celery, chevy
... grape, corn, chrysler
... '''.splitlines()
>>> reader = csv.DictReader(sample, skipinitialspace=True)
>>> rows = list(reader)
>>> for row in reversed(rows):
...     print row
... 
{'cars': 'chrysler', 'vegetables': 'corn', 'fruit': 'grape'}
{'cars': 'chevy', 'vegetables': 'celery', 'fruit': 'apple'}
{'cars': 'ford', 'vegetables': 'carrot', 'fruit': 'orange'}

read to a list and reverse:

lines = [x for x in reader]
for line in lines[::-1]:
    print line

{' cars': ' chrysler', ' vegetables': ' corn', 'fruit': 'grape'}
{' cars': ' chevy', ' vegetables': ' celery', 'fruit': 'apple'}
{' cars': ' ford', ' vegetables': ' carrot', 'fruit': 'orange'}

Or as Martijn Pieters suggested:

for line in reversed(list(reader)):

You don't actually have to read the whole file into memory.

A csv.DictReader doesn't actually require a file, just an iterable of strings.*

And you can read a text file in reverse order in average linear time with constant space with not too much overhead. It's not trivial, but it's not that hard:

def reverse_lines(*args, **kwargs):
    with open(*args, **kwargs) as f:
        buf = ''
        f.seek(0, io.SEEK_END)
        while f.tell():
            try:
                f.seek(-1024, io.SEEK_CUR)
            except OSError:
                bufsize = f.tell()
                f.seek(0, io.SEEK_SET)
                newbuf = f.read(bufsize)
                f.seek(0, io.SEEK_SET)
            else:
                newbuf = f.read(1024)
                f.seek(-1024, io.SEEK_CUR)
            buf = newbuf + buf
            lines = buf.split('\n')
            buf = lines.pop(0)
            yield from reversed(lines)
        yield buf

This isn't rigorously tested, and it strips off the newlines (which is fine for csv.DictReader , but not fine in general), and it's not optimized for unusual but possible edge cases (eg, for really long lines, it will be quadratic), and it requires Python 3.3, and the file doesn't go away until you close/release the iterator (it probably should be a context manager so you can deal with that)—but if you really want this, I'm willing to bet you can find a recipe on ActiveState or distribution on PyPI with none of those problems.

Anyway, for a medium-sized file, I suspect it'd actually be faster, on almost any real-life filesystem, to read the whole thing into memory in forward order then iterate the list in reverse. But for a very large file (especially one you can't even fit into memory), this solution is obviously much better.

From a quick test (see http://pastebin.com/Nst6WFwV for code), on my computer, the basic breakdown is:

  • Much slower for files <<1000 lines.
  • About 10% slower from 1K-1M lines.
  • Crossover around 30M lines.
  • 50% faster at 500M lines.
  • 1300% faster at 1.5G lines.
  • Effectively infinitely faster at 2.5G lines (the list-reversing version throws my machine into swap hell, and I have to ssh in to kill the process and wait a couple minutes for it to recover…).

Of course the details will depend on a lot of facts about your computer. It's probably no coincidence that 500M 72-char lines of ASCII takes up close to half the physical RAM on my machine. But with a hard drive instead of an SSD you'd probably see more penalty for reverse_lines (since random seeks would be much slower compared to contiguous reads, and disk in general would be more important). And your platform's malloc and VM behavior, and even locality issues (parsing a line almost immediately after reading it instead of after it's been swapped out and back in…) might make a difference. And so on.

Anyway, the lesson is, if you're not expecting at least 10s of millions of lines (or maybe a bit less on a very resource-constrained machine), don't even think about this; just keep it simple.


* As Martijn Pieters points out in the comments, if you're not using explicit fieldnames , DictReader requires an iterable of strings where the first line is the header . But you can fix that by reading the first line separately with a csv.reader and passing it as the fieldnames , or even by itertools.chain -ing all the first line from a forward read before all but the last lines of the backward read.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM