如何使用python轉置/樞轉csv文件，而無需將整個文件加載到內存中？

Question

對於我的數據分析管道之一，我最終生成了很多單獨的CSV文件。 我想轉置它們，將它們連接起來，然后再次轉置它們。 但是，數據量很大，因此將它們全部加載到內存中是不切實際的。

Answer 1

連接兩個csv文件中的數據行（如果這就是您的意思），而又不將它們全部都加載到內存中是一個相對容易和快速的操作：只需從每個行中讀取一行，將它們連接在一起，然后編寫將其復制到輸出文件，重復執行直到所有輸入數據都用完為止。

從本質上講，將數據轉換為csv文件中的數據而不將整個內容讀入內存將是一個非常慢的過程，因為它需要多次讀取整個輸入文件，每次都只從其中包含的一列中提取數據。 如果這是一個可接受的（或必要的）折衷方案，則基本上是使用內置的csv模塊來完成的方法：

import csv

input_filename = 'input.csv'
output_filename = 'output.csv'

with open(output_filename, 'wb') as outputf:
    writer = csv.writer(outputf)
    with open(input_filename, 'rb') as inputf:
        # determine number of columns in input file by counting those in its first row
        # number of cols in input file determines number of rows in output file
        numcols = len(csv.reader(inputf).next())
        # read entire input file multiple times, extracting one column from each row
        for col_index in xrange(numcols):
            # write all of column data as a single row of the output file
            inputf.seek(0)  # rewind file for each pass
            writer.writerow(tuple(row[col_index] for row in csv.reader(inputf)))

Answer 2

當字段具有固定寬度時，這是一種可行的解決方案：

import sys
import os


def main():

    path_in = sys.argv[-1]
    path_out = os.path.basename(path_in)+'.transposed'

    with open(path_in) as fd_in:
        line = fd_in.readline()
        l = line.split()
        field_width = int(len(line)/len(l))

    file_size = os.path.getsize(path_in)
    cols2 = rows1 = line_count = int(file_size/len(line))
    rows2 = cols1 = len(l)

    with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
        for row in range(rows2):
            for col in range(cols2-1):
                fd_in.seek(col*len(line)+row*field_width)
                fd_out.write('{} '.format(fd_in.read(field_width-1)))
            fd_in.seek((col+1)*len(line)+row*field_width)
            fd_out.write('{}\n'.format(fd_in.read(field_width-1)))

    return


if __name__ == '__main__':
    main()

如果字段沒有固定的寬度，這是一種可行的解決方案：

import sys
import os


def main():

    path_in = sys.argv[-1]
    path_out = os.path.basename(path_in)+'.transposed'
    separator = ' '

    d_seek = {}
    with open(path_in) as fd_in:
        i = 0
        while True:
            tell = fd_in.tell()
            if fd_in.readline() == '':
                break
            d_seek[i] = tell
            i += 1
    cols2 = rows1 = i

    with open(path_in) as fd_in:
        line = fd_in.readline()
    rows2 = cols1 = len(line.split(separator))
    del line

    with open(path_in) as fd_in, open(path_out, 'w') as fd_out:
        for row2 in range(rows2):
            for row1 in range(rows1):
                fd_in.seek(d_seek[row1])
                j = 0
                s = ''
                while True:
                    char = fd_in.read(1)
                    j += 1
                    if char == separator or char == '\n':
                        break
                    s += char
                d_seek[row1] += len(s)+1
                if row1+1 < rows1:
                    fd_out.write('{} '.format(s))
                else:
                    fd_out.write('{}\n'.format(s))

    return


if __name__ == '__main__':
    main()

Answer 3

另一個簡短的pythonic解決方案。 我用它來轉置15,000,000 x 12,000的CSV。 它是快速而純凈的python。 您需要做的所有其他事情都是微不足道的，這絕對是最難的部分。

Github鏈接： https : //gist.github.com/arose13/facfb91b609d453f3ad840417faa503a

    def transpose_csv_out_of_core(csv_path, output_csv_path='transposed.csv', delimiter=','):
    """
    On my laptop it can transpose at ~375,000 lines a sec

    :param csv_path: 
    :param output_csv_path: 
    :param delimiter: 
    :return: 
    """
    import csv

    transposed_iterator = zip(*csv.reader(open(csv_path)))
    with open(output_csv_path, 'w') as out:
        for row in transposed_iterator:
            out.write(delimiter.join(row) + '\n')

Answer 4

以下代碼模擬了從兩個csv文件中讀取。 第一個有兩行

[1,2,1]
[3,4,1]

第二個

[7,8,2]
[9,10.2].

結果是兩行

[1,2,1,7,8,2]
[3,4,1,9,10,2]

那是你想要的嗎？

def source1():
    for i in [ [1,2, 1] ,[3,4, 1]] : yield i

def source2():
    for i in [ [7,8,2] ,[9,10,2]] : yield i

def join(*sources):
    while True:
        row = []
        for s in sources:
            row.extend(s.next())
        yield row

for row in join(source1(), source2()):
    print row

在您的情況下，您必須用csv文件迭代器替換對source1（）和source2（）的調用。

Answer 5

使用發電機，例如

from itertools import izip

file1 = open("test", "r")
file2 = open("test2", "r")

def lazy(file):
    for line in file:
        #do something with the line
        yield line

for lines in izip(lazy(file1), lazy(file2)):
    print lines

http://wiki.python.org/moin/Generators

編輯：您可以使用CSV模塊來解析它，我也意識到文件對象的readlines（）方法不是惰性的，因此您必須在文件模式中使用for行。

如何使用python轉置/樞轉csv文件，而無需將整個文件加載到內存中？

問題描述

5 個解決方案

解決方案1
1 2013-03-25 18:05:08

解決方案2
0 2014-09-30 13:44:20

解決方案3
0 2017-04-08 00:42:16

解決方案4
0 2011-08-23 07:56:11

解決方案5
-1 已采納 2011-08-23 05:24:17

如何使用python轉置/樞轉csv文件，而無需將整個文件加載到內存中？

問題描述

5 個解決方案

解決方案1 1 2013-03-25 18:05:08

解決方案2 0 2014-09-30 13:44:20

解決方案3 0 2017-04-08 00:42:16

解決方案4 0 2011-08-23 07:56:11

解決方案5 -1 已采納 2011-08-23 05:24:17

解決方案1
1 2013-03-25 18:05:08

解決方案2
0 2014-09-30 13:44:20

解決方案3
0 2017-04-08 00:42:16

解決方案4
0 2011-08-23 07:56:11

解決方案5
-1 已采納 2011-08-23 05:24:17