简体   繁体   English

在python中转置和合并csv文件的快速方法?

[英]fast way to transpose and concat csv files in python?

I am trying to transpose multiple files of the same format and concatinating them into 1 big CSV file. 我正在尝试转置多个相同格式的文件,并将它们合并为1个大CSV文件。 I wanted to use numpy for transposing as its a really fast way of doing it but it somehow skips all my headers which i need. 我想使用numpy进行转置,因为这是一种非常快速的方法,但是它以某种方式跳过了我需要的所有标头。 These are my files: 这些是我的文件:

testfile1.csv
time,topic1,topic2,country
2015-10-01,20,30,usa
2015-10-02,25,35,usa

testfile2.csv
time,topic3,topic4,country
2015-10-01,40,50,uk
2015-10-02,45,55,uk

This is my code to transpose and merge all csv files into 1 big file: 这是我的代码,用于将所有csv文件转置并合并为1个大文件:

from numpy import genfromtxt
import csv

file_list=['testfile1.csv','testfile2.csv']

def transpose_append(csv_file):
    my_data = genfromtxt(item, delimiter=',',skip_header=0)
    print my_data, "my_data, not transposed"
    if i == 0:
        transposed_data = my_data.T
        print transposed_data, "transposed_data"
        for row in transposed_data:
            print row, "row from first file"
            csv_writer.writerow([row])
    else:
        transposed_data = my_data.T
        for row in transposed_data:
            print row, "row from second file"
            csv_writer.writerow([row][:1])


with open("combined_transposed_file.csv", 'wb') as outputfile:
    csv_writer = csv.writer(outputfile)

for i,item in enumerate(file_list):
    transpose_append(item)

outputfile.close()

This is the output of a print. 这是打印的输出。 It show transposing work somewhat but its missing my headers: 它显示了转置工作,但缺少我的标题:

[[ nan  nan  nan  nan]
 [ nan  20.  30.  nan]
 [ nan  25.  35.  nan]] my_data, not transposed
[[ nan  nan  nan]
 [ nan  20.  25.]
 [ nan  30.  35.]
 [ nan  nan  nan]] transposed_data

This is my expected output: 这是我的预期输出:

      ,2015-10-01,2015-10-02,country
topic1,20,25,usa
topic2,30,35,usa
topic3,40,45,uk
topic4,50,55,uk

There are various ways of handling headers in genfromtxt . genfromtxt有多种处理标头的genfromtxt The default is to treat them as part of the data: 默认值是将它们视为数据的一部分:

In [6]: txt="""time,topic1,topic2,country
   ...: 2015-10-01,20,30,usa
   ...: 2015-10-02,25,35,usa"""

In [7]: data=np.genfromtxt(txt.splitlines(),delimiter=',',skip_header=0)

In [8]: data
Out[8]: 
array([[ nan,  nan,  nan,  nan],
       [ nan,  20.,  30.,  nan],
       [ nan,  25.,  35.,  nan]])

But since the default dtype is float, the strings all appear as nan . 但是由于默认的dtype是float,因此所有字符串都显示为nan

You can treat them as headers - the result is a structured array. 您可以将它们视为标头-结果是结构化数组。 The headers now appear in the data.dtype.names list. 标头现在显示在data.dtype.names列表中。

In [9]: data=np.genfromtxt(txt.splitlines(),delimiter=',',names=True)

In [10]: data
Out[10]: 
array([(nan, 20.0, 30.0, nan), (nan, 25.0, 35.0, nan)], 
      dtype=[('time', '<f8'), ('topic1', '<f8'), ('topic2', '<f8'), ('country', '<f8')])

With dtype=None , you let it choose the dtype. 使用dtype=None ,让它选择dtype。 Based on the strings in the 1st line, it loads everything as S10 . 基于第一行中的字符串,它将所有内容加载为S10

In [11]: data=np.genfromtxt(txt.splitlines(),delimiter=',',dtype=None)

In [12]: data
Out[12]: 
array([['time', 'topic1', 'topic2', 'country'],
       ['2015-10-01', '20', '30', 'usa'],
       ['2015-10-02', '25', '35', 'usa']], 
      dtype='|S10')

This matrix can be transposed, and printed or written to a csv file: 该矩阵可以转置,打印或写入csv文件:

In [13]: data.T
Out[13]: 
array([['time', '2015-10-01', '2015-10-02'],
       ['topic1', '20', '25'],
       ['topic2', '30', '35'],
       ['country', 'usa', 'usa']], 
      dtype='|S10')

Since I'm using genfromtxt to load, I could use savetxt to save: 由于我正在使用genfromtxt进行加载,因此可以使用savetxt保存:

In [26]: with open('test.txt','w') as f:
    np.savetxt(f, data.T, delimiter=',', fmt='%12s')
    np.savetxt(f, data.T, delimiter=';', fmt='%10s') # simulate a 2nd array
   ....:     

In [27]: cat test.txt
        time,  2015-10-01,  2015-10-02
      topic1,          20,          25
      topic2,          30,          35
     country,         usa,         usa
      time;2015-10-01;2015-10-02
    topic1;        20;        25
    topic2;        30;        35
   country;       usa;       usa

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM