简体   繁体   English

Python csv.DictReader - 如何反转输出?

[英]Python csv.DictReader - how to reverse output?

I'm trying to reverse the way a file is read. 我试图扭转文件的读取方式。 I am using DictReader because I want the contents in a Dictionary. 我正在使用DictReader,因为我想要一个字典中的内容。 I'd like to read the first line in the file and use that for the Keys, then parse the file in reverse (bottom to top) kind of like the linux "tac" command. 我想读取文件中的第一行并将其用于Keys,然后反向解析文件(从下到上),类似于linux“tac”命令。 Is there an easy way to do this? 是否有捷径可寻? Below is my code to read the file into a dictionary and write it to a file... 下面是我的代码,将文件读入字典并将其写入文件...

reader = csv.DictReader(open(file_to_parse, 'r'), delimiter=',', quotechar='"')
for line in reader:
    # ...

This code works to process the file normally, however.. I need it to read the file from the end. 这段代码可以正常处理文件,但是我需要它从最后读取文件。

In other words, I'd like it to read the file: 换句话说,我希望它能够读取文件:

fruit, vegetables, cars
orange, carrot, ford
apple, celery, chevy
grape, corn, chrysler

and be able to have it return: 并能够让它返回:

{' cars': ' chrysler', ' vegetables': ' corn', 'fruit': 'grape'}
{' cars': ' chevy', ' vegetables': ' celery', 'fruit': 'apple'}
{' cars': ' ford', ' vegetables': ' carrot', 'fruit': 'orange'}

instead of: 代替:

{' cars': ' ford', ' vegetables': ' carrot', 'fruit': 'orange'}
{' cars': ' chevy', ' vegetables': ' celery', 'fruit': 'apple'}
{' cars': ' chrysler', ' vegetables': ' corn', 'fruit': 'grape'}

You'll have to read the whole CSV file into memory; 您必须将整个 CSV文件读入内存; you can do so by calling list() on the reader object: 你可以通过调用reader对象上的list()来实现:

with open(file_to_parse, 'rb') as inf:
    reader = csv.DictReader(inf, skipinitialspace=True)
    rows = list(reader)

for row in reversed(rows):

Note that I used the file as a context manager here to ensure that the file is closed. 请注意,我在此处使用该文件作为上下文管理器以确保文件已关闭。 You also want to open the file in binary mode (leave newline handling to the csv module). 您还希望以二进制模式打开文件(将换行处理留给csv模块)。 The rest of the configuration you passed to the DictReader() are the defaults, so I omitted them. 您传递给DictReader()的其余配置是默认值,因此我省略了它们。

I set skipinitialspace to True, as judging from your sample input and output you do have spaces after your delimiters; 我将skipinitialspace设置为True,从您的示例输入和输出判断,您的分隔符后面有空格; the option removes these. 该选项删除了这些。

The csv.DictReader() object takes care of reading that first line as the keys. csv.DictReader()对象负责将第一行作为键读取。

Demo: 演示:

>>> import csv
>>> sample = '''\
... fruit, vegetables, cars
... orange, carrot, ford
... apple, celery, chevy
... grape, corn, chrysler
... '''.splitlines()
>>> reader = csv.DictReader(sample, skipinitialspace=True)
>>> rows = list(reader)
>>> for row in reversed(rows):
...     print row
... 
{'cars': 'chrysler', 'vegetables': 'corn', 'fruit': 'grape'}
{'cars': 'chevy', 'vegetables': 'celery', 'fruit': 'apple'}
{'cars': 'ford', 'vegetables': 'carrot', 'fruit': 'orange'}

read to a list and reverse: 读到列表并反转:

lines = [x for x in reader]
for line in lines[::-1]:
    print line

{' cars': ' chrysler', ' vegetables': ' corn', 'fruit': 'grape'}
{' cars': ' chevy', ' vegetables': ' celery', 'fruit': 'apple'}
{' cars': ' ford', ' vegetables': ' carrot', 'fruit': 'orange'}

Or as Martijn Pieters suggested: 或者正如Martijn Pieters所说:

for line in reversed(list(reader)):

You don't actually have to read the whole file into memory. 您实际上不必将整个文件读入内存。

A csv.DictReader doesn't actually require a file, just an iterable of strings.* csv.DictReader实际上并不需要一个文件,只是一个可迭代的字符串。*

And you can read a text file in reverse order in average linear time with constant space with not too much overhead. 并且您可以以平均线性时间以相反的顺序读取文本文件,其中空间不变,而且开销不会太大 It's not trivial, but it's not that hard: 这不是微不足道的,但并不难:

def reverse_lines(*args, **kwargs):
    with open(*args, **kwargs) as f:
        buf = ''
        f.seek(0, io.SEEK_END)
        while f.tell():
            try:
                f.seek(-1024, io.SEEK_CUR)
            except OSError:
                bufsize = f.tell()
                f.seek(0, io.SEEK_SET)
                newbuf = f.read(bufsize)
                f.seek(0, io.SEEK_SET)
            else:
                newbuf = f.read(1024)
                f.seek(-1024, io.SEEK_CUR)
            buf = newbuf + buf
            lines = buf.split('\n')
            buf = lines.pop(0)
            yield from reversed(lines)
        yield buf

This isn't rigorously tested, and it strips off the newlines (which is fine for csv.DictReader , but not fine in general), and it's not optimized for unusual but possible edge cases (eg, for really long lines, it will be quadratic), and it requires Python 3.3, and the file doesn't go away until you close/release the iterator (it probably should be a context manager so you can deal with that)—but if you really want this, I'm willing to bet you can find a recipe on ActiveState or distribution on PyPI with none of those problems. 这没有经过严格的测试,它剥离了换行符(这对于csv.DictReader来说很好,但一般来说不是很好),并且它没有针对不寻常但可能的边缘情况进行优化(例如,对于非常长的行,它将是二次),它需要Python 3.3,并且文件不会消失,直到你关闭/释放迭代器(它可能应该是一个上下文管理器,所以你可以处理它) - 但如果你真的想要这个,我是愿意打赌你可以在ActiveState上找到一个配方或在PyPI上找到一个没有这些问题的分配。

Anyway, for a medium-sized file, I suspect it'd actually be faster, on almost any real-life filesystem, to read the whole thing into memory in forward order then iterate the list in reverse. 无论如何,对于一个中等大小的文件,我怀疑在几乎任何现实生活中的文件系统上实际上都要以正向顺序将整个内容读入内存然后反向迭代列表。 But for a very large file (especially one you can't even fit into memory), this solution is obviously much better. 但是对于一个非常大的文件(特别是一个你甚至无法适应内存的文件),这个解决方案显然要好得多。

From a quick test (see http://pastebin.com/Nst6WFwV for code), on my computer, the basic breakdown is: 通过快速测试(请参阅http://pastebin.com/Nst6WFwV获取代码),在我的计算机上,基本细分是:

  • Much slower for files <<1000 lines. 文件<< 1000行的速度要慢得多。
  • About 10% slower from 1K-1M lines. 从1K-1M线路减慢约10%。
  • Crossover around 30M lines. 交叉约30M线。
  • 50% faster at 500M lines. 500M线路速度提高50%。
  • 1300% faster at 1.5G lines. 1.5G线路速度提高1300%。
  • Effectively infinitely faster at 2.5G lines (the list-reversing version throws my machine into swap hell, and I have to ssh in to kill the process and wait a couple minutes for it to recover…). 在2.5G线路上有效地无限快速(列表反转版本将我的机器投入交换地狱,我必须ssh in以终止进程并等待几分钟才能恢复......)。

Of course the details will depend on a lot of facts about your computer. 当然,细节将取决于有关您的计算机的大量事实。 It's probably no coincidence that 500M 72-char lines of ASCII takes up close to half the physical RAM on my machine. 可能并非巧合的是,500M 72-char的ASCII线占据了我机器上近一半的物理RAM。 But with a hard drive instead of an SSD you'd probably see more penalty for reverse_lines (since random seeks would be much slower compared to contiguous reads, and disk in general would be more important). 但是使用硬盘而不是SSD你可能会看到对reverse_lines更多惩罚(因为随机读取与连续读取相比会慢很多,而且通常磁盘会更重要)。 And your platform's malloc and VM behavior, and even locality issues (parsing a line almost immediately after reading it instead of after it's been swapped out and back in…) might make a difference. 而你的平台的malloc和VM行为,甚至地点问题(在读取它之后几乎立即解析一条线而不是在它被换出并重新进入......之后)可能会有所不同。 And so on. 等等。

Anyway, the lesson is, if you're not expecting at least 10s of millions of lines (or maybe a bit less on a very resource-constrained machine), don't even think about this; 无论如何,教训是,如果你不期望至少有数百万行(或者在资源有限的机器上可能少一点),甚至不要考虑这个问题; just keep it simple. 保持简单。


* As Martijn Pieters points out in the comments, if you're not using explicit fieldnames , DictReader requires an iterable of strings where the first line is the header . *正如Martijn Pieters在评论中指出的那样,如果你没有使用显式fieldnamesDictReader需要一个可迭代的字符串,其中第一行是标题 But you can fix that by reading the first line separately with a csv.reader and passing it as the fieldnames , or even by itertools.chain -ing all the first line from a forward read before all but the last lines of the backward read. 但是你可以通过分别用csv.reader读取第一行并将其作为fieldnames传递来解决这个问题,甚至可以通过itertools.chain来解决这个问题。来自前向读取的所有第一行除了后向读取的最后csv.reader行之外。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM