简体   繁体   English

如何使用python转置/枢转csv文件,而无需将整个文件加载到内存中?

[英]How do I transpose/pivot a csv file with python *without* loading the whole file into memory?

For one of my data analysis pipelines, I end up generating a lot of individual CSV files. 对于我的数据分析管道之一,我最终生成了很多单独的CSV文件。 I would like to transpose them, concatenate them, and transpose them again. 我想转置它们,将它们连接起来,然后再次转置它们。 However, the amount of data is large, so loading it all into memory is not practical. 但是,数据量很大,因此将它们全部加载到内存中是不切实际的。

Concatenating the rows of data from two csv files (if that's what you meant) without loading all of both of them into memory is a relatively easy and fast operation: Just read in a single row from each one, join those together, and then write that to an output file, repeating until all the input data is exhausted. 连接两个csv文件中的数据行(如果这就是您的意思),而又不将它们全部都加载到内存中是一个相对容易和快速的操作:只需从每个行中读取一行,将它们连接在一起,然后编写将其复制到输出文件,重复执行直到所有输入数据都用完为止。

Transposing the data in a csv file is without reading the entire thing into memory is intrinsically going to be a much slower process, since it requires the entire input file to be reread in multiple passes, each time extracting that data from just one column it contains. 从本质上讲,将数据转换为csv文件中的数据而不将整个内容读入内存将是一个非常慢的过程,因为它需要多次读取整个输入文件,每次都只从其中包含的一列中提取数据。 If that's an acceptable (or necessary) trade-off, here's basically how it would be done using the built-in csv module: 如果这是一个可接受的(或必要的)折衷方案,则基本上是使用内置的csv模块来完成的方法:

import csv

input_filename = 'input.csv'
output_filename = 'output.csv'

with open(output_filename, 'wb') as outputf:
    writer = csv.writer(outputf)
    with open(input_filename, 'rb') as inputf:
        # determine number of columns in input file by counting those in its first row
        # number of cols in input file determines number of rows in output file
        numcols = len(csv.reader(inputf).next())
        # read entire input file multiple times, extracting one column from each row
        for col_index in xrange(numcols):
            # write all of column data as a single row of the output file
            inputf.seek(0)  # rewind file for each pass
            writer.writerow(tuple(row[col_index] for row in csv.reader(inputf)))

Here is a solution that works, when fields have fixed widths: 当字段具有固定宽度时,这是一种可行的解决方案:

import sys
import os


def main():

    path_in = sys.argv[-1]
    path_out = os.path.basename(path_in)+'.transposed'

    with open(path_in) as fd_in:
        line = fd_in.readline()
        l = line.split()
        field_width = int(len(line)/len(l))

    file_size = os.path.getsize(path_in)
    cols2 = rows1 = line_count = int(file_size/len(line))
    rows2 = cols1 = len(l)

    with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
        for row in range(rows2):
            for col in range(cols2-1):
                fd_in.seek(col*len(line)+row*field_width)
                fd_out.write('{} '.format(fd_in.read(field_width-1)))
            fd_in.seek((col+1)*len(line)+row*field_width)
            fd_out.write('{}\n'.format(fd_in.read(field_width-1)))

    return


if __name__ == '__main__':
    main()

Here is a solution that works, if the fields don't have fixed widths: 如果字段没有固定的宽度,这是一种可行的解决方案:

import sys
import os


def main():

    path_in = sys.argv[-1]
    path_out = os.path.basename(path_in)+'.transposed'
    separator = ' '

    d_seek = {}
    with open(path_in) as fd_in:
        i = 0
        while True:
            tell = fd_in.tell()
            if fd_in.readline() == '':
                break
            d_seek[i] = tell
            i += 1
    cols2 = rows1 = i

    with open(path_in) as fd_in:
        line = fd_in.readline()
    rows2 = cols1 = len(line.split(separator))
    del line

    with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
        for row2 in range(rows2):
            for row1 in range(rows1):
                fd_in.seek(d_seek[row1])
                j = 0
                s = ''
                while True:
                    char = fd_in.read(1)
                    j += 1
                    if char == separator or char == '\n':
                        break
                    s += char
                d_seek[row1] += len(s)+1
                if row1+1 < rows1:
                    fd_out.write('{} '.format(s))
                else:
                    fd_out.write('{}\n'.format(s))

    return


if __name__ == '__main__':
    main()

Another short and pythonic solution. 另一个简短的pythonic解决方案。 I used this to transpose CSVs that are 15,000,000 x 12,000 . 我用它来转置15,000,000 x 12,000的CSV。 It's fast and pure python. 它是快速而纯净的python。 Everything else you need done is trivial and this is definitely the hardest part. 您需要做的所有其他事情都是微不足道的,这绝对是最难的部分。

Github link: https://gist.github.com/arose13/facfb91b609d453f3ad840417faa503a Github链接: https : //gist.github.com/arose13/facfb91b609d453f3ad840417faa503a

    def transpose_csv_out_of_core(csv_path, output_csv_path='transposed.csv', delimiter=','):
    """
    On my laptop it can transpose at ~375,000 lines a sec

    :param csv_path: 
    :param output_csv_path: 
    :param delimiter: 
    :return: 
    """
    import csv

    transposed_iterator = zip(*csv.reader(open(csv_path)))
    with open(output_csv_path, 'w') as out:
        for row in transposed_iterator:
            out.write(delimiter.join(row) + '\n')

The following code simulates reading from two csv files. 以下代码模拟了从两个csv文件中读取。 The first one has the two rows 第一个有两行

[1,2,1]
[3,4,1]

the second one 第二个

[7,8,2]
[9,10.2].

The result are the two rows 结果是两行

[1,2,1,7,8,2]
[3,4,1,9,10,2]

Is that what you wanted ? 那是你想要的吗?

def source1():
    for i in [ [1,2, 1] ,[3,4, 1]] : yield i

def source2():
    for i in [ [7,8,2] ,[9,10,2]] : yield i

def join(*sources):
    while True:
        row = []
        for s in sources:
            row.extend(s.next())
        yield row

for row in join(source1(), source2()):
    print row

In your case you have to replace calls to source1() and source2() by the csv file iterators. 在您的情况下,您必须用csv文件迭代器替换对source1()和source2()的调用。

Use a generator, eg 使用发电机,例如

from itertools import izip

file1 = open("test", "r")
file2 = open("test2", "r")

def lazy(file):
    for line in file:
        #do something with the line
        yield line

for lines in izip(lazy(file1), lazy(file2)):
    print lines

http://wiki.python.org/moin/Generators http://wiki.python.org/moin/Generators

Edit: You can use the CSV module to parse it, also I realized that the readlines() method of file objects isn't lazy, so you have to use the for line in file pattern. 编辑:您可以使用CSV模块来解析它,我也意识到文件对象的readlines()方法不是惰性的,因此您必须在文件模式中使用for行。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 python 中打开一个 csv 文件,一次读取一行,而不将整个 csv 文件加载到内存中? - How can I open a csv file in python, and read one line at a time, without loading the whole csv file in memory? 如何有效地转置 67 gb 文件/Dask 数据帧而不将其完全加载到内存中? - How can I efficiently transpose a 67 gb file/Dask dataframe without loading it entirely into memory? 在python中转置csv文件 - Transpose csv file in python Python - 如何在不立即将整个文件加载到内存中的情况下发送带有附件的电子邮件? - Python - How to send an email with attachment without loading the whole file into memory at once? 如何使用python转置csv文件中的数据 - How to transpose datas in csv file using python 编辑.csv而不读取整个文件(python) - Edit a .csv without read the whole file (python) 我将如何将 .csv 转换为 .arrow 文件而不将其全部加载到内存中? - How would I go about converting a .csv to an .arrow file without loading it all into memory? 如何重新搜索或重新匹配整个文件而不将其全部读入内存? - How do I re.search or re.match on a whole file without reading it all into memory? 在python中加载一个txt文件的第n行而不加载整个文件 - Loading the nth line of a txt file in python without loading the whole file 将行添加到CSV中并按ID在Python中排序,而无需将整个文件读入内存 - Adding rows to CSV sorted by ID in Python without reading whole file into memory
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM