简体   繁体   English

Python-从文件获取列迭代器(无需读取整个文件)

[英]Python - get column iterator from a file (without reading the whole file)

My main goal is to calculate median(by columns) from a HUGE matrix of floats. 我的主要目标是从巨大的浮点数矩阵计算中位数(按列)。 Example: 例:

a = numpy.array(([1,1,3,2,7],[4,5,8,2,3],[1,6,9,3,2]))

numpy.median(a, axis=0)

Out[38]: array([ 1.,  5.,  8.,  2.,  3.])

The matrix is too big to fit in the Python memory (~5 terabytes), so I keep it in a csv file. 矩阵太大,无法放入Python内存(〜5 TB),因此我将其保存在一个csv文件中。 So I want to run over each column and calculate median. 所以我想遍历每列并计算中位数。

Is there any way for me to get column iterator without reading the whole file? 我有什么办法可以在不读取整个文件的情况下获取列迭代器?

Any other ideas about calculating the median for the matrix would be good too. 关于计算矩阵中位数的任何其他想法也将很好。 Thank you! 谢谢!

If you can fit each column into memory (which you seem to imply you can), then this should work: 如果您可以将每一列都放入内存(似乎暗示可以),那么这应该可以工作:

import itertools
import csv

def columns(file_name):
   with open(file_name) as file:
       data = csv.reader(file)
       columns = len(next(data))
   for column in range(columns):
       with open(file_name) as file:
           data = csv.reader(file)
           yield [row[column] for row in data]

This works by finding out how many columns we have, then looping over the file, taking the current column's item out of each row. 通过找出我们有多少列,然后循环遍历文件,将当前列的项目从每一行中取出来进行工作。 This means, at most, we are using the size of a column plus the size of a row of memory at one time. 这意味着,我们最多一次使用一列的大小加上一行存储器的大小。 It's a pretty simple generator. 这是一个非常简单的生成器。 Note we have to keep reopening the file, as we exhaust the iterator when we loop through it. 请注意,我们必须不断重新打开文件,因为在遍历文件时会耗尽迭代器。

I would do this by initializing N empty files, one for each column. 我将通过初始化N个空文件来完成此操作,每列一个。 Then read the matrix one row at a time and send each column entry to the correct file. 然后一次读取矩阵一行,并将每个列条目发送到正确的文件。 Once you've processed the whole matrix, go back and calculate the median of each file sequentially. 处理完整个矩阵后,请返回并依次计算每个文件的中位数。

This basically uses the filesystem to do a matrix transpose. 这基本上是使用文件系统进行矩阵转置。 Once transposed, calculating the median of each row is easy. 一旦转置,就很容易计算每行的中位数。

There's probably no direct way to do what you're asking with a csv file (unless I've misunderstood you). 可能没有直接方法来处理csv文件(除非我误解了您)。 The problem is that there's no meaningful sense in which any file has "columns" unless the file is specially designed to have fixed width rows. 问题是,除非文件专门设计为具有固定宽度的行,否则任何文件都没有“列”的意义。 CSV files aren't generally designed that way. CSV文件通常不是这样设计的。 On disk, they're nothing more than a giant string: 在磁盘上,它们不过是一个巨大的字符串:

>>> import csv
>>> with open('foo.csv', 'wb') as f:
...     writer = csv.writer(f)
...     for i in range(0, 100, 10):
...         writer.writerow(range(i, i + 10))
... 
>>> with open('foo.csv', 'r') as f:
...     f.read()
... 
'0,1,2,3,4,5,6,7,8,9\r\n10,11,12,13,14,15,16,17,18,19\r\n20..(output truncated)..

As you can see, the column fields don't line up predictably; 如您所见,列字段不是按预期排列的; the second column starts at index 2, but then in the next row, the width of columns increases by one, throwing off the alignment. 第二列从索引2开始,但是在下一行中,列的宽度增加一,从而使对齐方式无效。 This is even worse when input lengths vary. 当输入长度变化时,情况甚至更糟。 The upshot is that the csv reader will have to read the entire file, throwing out the data you don't use. 结果是csv阅读器将不得不读取整个文件,从而丢弃不使用的数据。 (If you don't mind that, then that's the answer -- read the whole file line by line, throwing out the data you won't use.) (如果您不介意的话,那就是答案–逐行读取整个文件,丢弃不使用的数据。)

If you don't mind wasting some space and know that none of your data will be longer than some fixed width, you could create a file with fixed-width fields, and then you could seek through it using offsets. 如果您不介意浪费空间并且知道所有数据都不会超过某个固定宽度,则可以创建一个具有固定宽度字段的文件,然后可以使用偏移量进行查找。 But then, once you're doing that, you might as well start using a real database. 但是,一旦这样做,您不妨开始使用真实的数据库。 PyTables seems to be the favorite choice of many for storing numpy arrays. PyTables似乎是许多用于存储numpy数组的首选。

You can use bucketsort to sort each of the columns on disk without reading them all into memory. 您可以使用bucketsort对磁盘上的每个列进行排序,而无需将它们全部读入内存。 Then you can simply pick the middle value. 然后,您可以简单地选择中间值。

Or you can use the UNIX awk and sort commands to split and then sort your columns before you select the median. 或者,您可以使用UNIX awksort命令进行拆分,然后在选择中位数之前对列进行排序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM