简体   繁体   English

解析.TSV文件并通过重新排列列将数据写入新的.TSV文件

[英]Parsing a .TSV file and write the data into a new .TSV file by re arranging the columns

So, I want to read a TSV file (>1M rows) and open another tsv file which will copy the exact data but re-arrange the columns. 因此,我想读取一个TSV文件(> 1M行)并打开另一个tsv文件,该文件将复制确切的数据,但重新排列列。

For example, 例如,

Original tsv file: 原始的tsv文件:

A   B . . . . .H
a1  b1.. . . . h1
a2  b2. . . . .h2
a3  b3. . . . .h3
.   .. . . . . . so on. 

(The first line are the headers) (第一行是标题)

I know how to create, read and write a file but I don't know how to re-arrange the columns. 我知道如何创建,读取和写入文件,但不知道如何重新排列列。

file_location = 'abc.tsv'
output_filename = 'sample.tsv'


def main():
    file_reader = open(file_location,'r')
    new_file = open(output_filename,'w')

    for rows in file_reader:
        try:
                rows = rows.strip().split('\t')


        except Exception, e:
            print('Error in reading file: %s' % e)
            pass

    file_reader.close()
    new_file.close()


if __name__ == '__main__':
    main()

Expected output : 预期产量:

D   G . . . . B
d1  g1. . . . b1
d2  g2. . . . b2
d3  g3. . . . b3
d4  g4. . . . b4
.   . . . . . .
.  .  . . . . . so on.

Any ideas are appreciated. 任何想法表示赞赏。 Thank you. 谢谢。

As I mentioned in a comment, you could use the csv module to do this. 正如我在评论中提到的那样,您可以使用csv模块来执行此操作。 It would also be fairly fast (notice there's no explicit loop over the rows or fields of the files, plus the csv module is written in C). 这也将是相当快的(请注意,文件的行或字段上没有显式循环,而且csv模块是用C编写的)。

For example: 例如:

import csv


file_location = 'abc.tsv'
output_filename = 'sample.tsv'
infields =  'A', 'B', 'C', 'D', 'G', 'H'
outfields = 'D', 'G', 'A', 'H', 'C', 'B'


def main():
    with open(file_location, 'r', newline='') as inp, \
         open(output_filename, 'w', newline='') as outp:

        reader = csv.DictReader(inp, fieldnames=infields, delimiter='\t')
        writer = csv.DictWriter(outp, fieldnames=outfields, delimiter='\t',
                                extrasaction='ignore')

        writer.writerows(reader)


if __name__ == '__main__':
    main()

You can do this with ease using pandas, just convert the file to a pandas dataframe and alter the columns order as you want of the dataframe and then write it back to a TSV file. 您可以使用pandas轻松地做到这一点,只需将文件转换为pandas数据框,并根据需要更改数据框的列顺序,然后将其写回到TSV文件中。

For reading the file into a pandas dataframe use: 要将文件读入熊猫数据框,请使用:

import pandas as pd    
df = pd.read_csv("abc.tsv", sep='\t', header=0)

You learn about basics of pandas in here 您可以在这里了解熊猫的基础知识

Something like this: 像这样:

(I did not change the table headers locations) (我没有更改表头的位置)
I have also skipped reading/writing from/to files since I assume this is not a challenge for you. 我也跳过了对文件的读/写操作,因为我认为这对您来说不是挑战。

original_data = [['A','B','C'],['a1','b1','c1'],['a2','b2','c2']]

def switch_columns(column_pairs,entries):
  for pair in column_pairs:
    for idx,entry in enumerate(entries):
      if idx > 0: 
        temp = entry[pair[0]]
        entry[pair[0]] = entry[pair[1]]
        entry[pair[1]] = temp

print('Before:')
print(original_data)
switch_columns([(0,2)],original_data)
print('After:')
print(original_data)

output 产量

Before:
[['A', 'B', 'C'], ['a1', 'b1', 'c1'], ['a2', 'b2', 'c2']]
After:
[['A', 'B', 'C'], ['c1', 'b1', 'a1'], ['c2', 'b2', 'a2']]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM