简体   繁体   中英

Split a multiple line text file into a multiple line csv file

I have a text file that contains data in the following form;

100157  100157
100157  364207
100157  38848
100157  bradshaw97introduction
100157  bylund99coordinating
100157  dix01metaagent
100157  gray99finding
...
...

I'm trying to convert this into a scikit readable dataset using the following method:

datafile = open(filename.txt, 'r')
data=[]
for row in datafile:
    data.append(row.strip().split('\t'))

c1 = open(filename.csv, 'w')
arr = str(data)
c.write(arr)
c.close

However after executing this code, the data gets outputted in a single row whereas I intend to get the data seperated in the csv format neatly in row and columns, like that of the Iris dataset.

Could I get some help as to how I should proceed? Thanks.

Use csv module :

import csv

with open('filename.txt', 'r') as f, open('filename.csv', 'w') as fout:
    writer = csv.writer(fout)
    writer.writerows(line.rstrip().split('\t') for line in f)

output csv file:

100157,100157
100157,364207
100157,38848
100157,bradshaw97introduction
100157,bylund99coordinating
100157,dix01metaagent
100157,gray99finding
...

Correct me if I'm wrong, but I think that scikit readable dataset is just space separated values with \\n separating the rows?

If so, quite easy:

Assume you have this file:

100157  100157
100157  364207
100157  38848
100157  bradshaw97introduction
100157  bylund99coordinating
100157  dix01metaagent
100157  gray99finding

Separated by tabs.

You can easily turn that into space separated new line delimited values:

with open('/tmp/test.csv', 'r') as fin, open('/tmp/test.out', 'w') as fout:
    data=[row.strip().split('\t') for row in fin]
    st='\n'.join(' '.join(e) for e in data)
    fout.write(st)

print data  
# [['100157', '100157'], ['100157', '364207'], ['100157', '38848'], ['100157', 'bradshaw97introduction'], ['100157', 'bylund99coordinating'], ['100157', 'dix01metaagent'], ['100157', 'gray99finding']]
print st   
100157 100157
100157 364207
100157 38848
100157 bradshaw97introduction
100157 bylund99coordinating
100157 dix01metaagent
100157 gray99finding

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM