简体   繁体   English

如何在python中按ID分割文本文件

[英]How to split text file by id in python

I have a bunch of text files containing tab separated tables. 我有一堆包含制表符分隔表的文本文件。 The second column contains an id number, and each file is already sorted by that id number. 第二列包含一个ID号,每个文件已经按该ID号排序。 I want to separate each file into multiple files by the id number in column 2. Here's what I have. 我想通过第2列中的ID号将每个文件分成多个文件。这就是我所拥有的。

readpath = 'path-to-read-file'
writepath = 'path-to-write-file'
for filename in os.listdir(readpath):
     with open(readpath+filename, 'r') as fh:
          lines = fh.readlines()
     lastid = 0
     f = open(writepath+'checkme.txt', 'w')
     f.write(filename)
     for line in lines:
          thisid = line.split("\t")[1]
          if int(thisid) <> lastid:
               f.close()
               f = open(writepath+thisid+'-'+filename,'w')
               lastid = int(thisid)
          f.write(line)
     f.close()

What I get is simply a copy of all the read files with the first id number from each file in front of the new filenames. 我得到的只是来自新文件名前面的每个文件中所有具有第一个ID号的已读取文件的副本。 It is as if 好像

thisid = line.split("\t")[1]

is only done once in the loop. 在循环中仅执行一次。 Any clue to what is going on? 有什么线索吗?

EDIT 编辑

The problem was my files used \\r rather than \\r\\n to terminate lines. 问题是我的文件使用\\ r而不是\\ r \\ n来终止行。 Corrected code (simply adding 'rU' when opening the read file and swapping != for <>): 更正了代码(打开读取文件并将<=换成<>时,只需添加'rU'):

readpath = 'path-to-read-file'
writepath = 'path-to-write-file'
for filename in os.listdir(readpath):
     with open(readpath+filename, 'rU') as fh:
          lines = fh.readlines()
     lastid = 0
     f = open(writepath+'checkme.txt', 'w')
     f.write(filename)
     for line in lines:
          thisid = line.split("\t")[1]
          if int(thisid) != lastid:
               f.close()
               f = open(writepath+thisid+'-'+filename,'w')
               lastid = int(thisid)
          f.write(line)
     f.close()

If you're dealing with tab delimited files, then you can use the csv module, and take advantage of the fact that itertools.groupby will do the previous/current tracking of the id for you. 如果要处理制表符分隔的文件,则可以使用csv模块,并利用itertools.groupby将为您执行ID的先前/当前跟踪这一事实。 Also utilise os.path.join to make sure your filenames end up joining correctly. 还可以使用os.path.join来确保您的文件名最终正确连接。

Untested: 未经测试:

import os
import csv
from itertools import groupby

readpath = 'path-to-read-file'
writepath = 'path-to-write-file'

for filename in os.listdir(readpath):
    with open(os.path.join(readpath, filename)) as fin:
        tabin = csv.reader(fin, delimiter='\t')
        for file_id, rows in groupby(tabin, lambda L: L[1]):
            with open(os.path.join(writepath, file_id + '-' + filename), 'w') as fout:
                tabout = csv.writer(fout, delimiter='\t')
                tabout.writerows(rows)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM