简体   繁体   English

通过逐行同时读取它们来合并多个文件?

[英]merge multiple files by reading them simultaneously line by line?

I have 3 files: 我有3个文件:

file1: 文件1:

    chrM    6423    5
    chrM    6432    4
    chrM    7575    1
    chrM    7670    1
    chrM    7933    1
    chrM    7984    1
    chrM    8123    1
    chrM    9944    1
    chrM    10434   1
    chrM    10998   13
    chrM    10999   19
    chrM    11024   17
    chrM    11025   29
    chrM    11117   21
    chrM    11118   42
    chr1    197095350   2
chr1    197103061   1
chr1    197103582   1
chr1    197103615   1
chr1    197103810   3
chr1    197103885   2
chr1    197104256   1
chr1    197107467   4
chr1    197107480   5
chr1    197107498   6
chr1    197107528   10
chr1    197107805   1
chr1    197107806   1
chr1    197107813   1
chr1    197107814   1
chr1    197107839   1
chr1    197107840   1
chr1    197107855   1
chr1    197107856   1
chr1    197107877   1
chr1    197107878   1
chr1    197111511   1
chr1    197120122   1
chr1    197125503   1
chr1    197126978   1
chr1    197127070   1
chr1    197127084   1
chr1    197129731   2
chr1    197129758   2
chr1    197129765   1
chr1    197167632   2
chr1    197167652   2
chr1    197167668   2
chr1    197167682   2
chr1    197181417   1
chr1    197181973   3
chr1    197181975   3
chr1    197192150   0

file2: 文件2:

  chrM  6423    5
    chrM    6432    4
    chrM    6582    1
    chrM    6640    1
    chrM    6643    1
    chrM    7140    1
    chrM    10998   7
    chrM    10999   8
    chrM    11024   10
    chrM    11025   13
    chrM    11117   12
    chrM    11118   33
    chr1    197095157   2
chr1    197095185   2
chr1    197098860   1
chr1    197105061   1
chr1    197107422   1
chr1    197107436   1
chr1    197107467   3
chr1    197107480   4
chr1    197107498   3
chr1    197107528   4
chr1    197107805   2
chr1    197107813   2
chr1    197107839   1
chr1    197108557   1
chr1    197108591   1
chr1    197108596   1
chr1    197108617   1
chr1    197108651   1
chr1    197139308   1
chr1    197139335   1
chr1    197143403   1
chr1    197143442   1
chr1    197145546   1
chr1    197148715   1
chr1    197148723   1
chr1    197148731   1
chr1    197148761   1
chr1    197153190   1
chr1    197166831   1
chr1    197166847   2
chr1    197166922   2
chr1    197166950   1
chr1    197166954   1
chr1    197167041   1
chr1    197167778   1
chr1    197167791   1
chr1    197167834   1
chr1    197167857   2
chr1    197167860   2
chr1    197167865   1
chr1    197167867   1
chr1    197167871   1
chr1    197167935   2
chr1    197167946   2
chr1    197167948   2
chr1    197167951   2
chr1    197167974   1
chr1    197167980   1
chr1    197168142   1
chr1    197168163   1
chr1    197168195   1
chr1    197168210   1
chr1    197169548   1
chr1    197169580   1
chr1    197169609   1
chr1    197183318   1
chr1    197183404   1
chr1    197184910   1
chr1    197184937   1
chr1    197186368   1
chr1    197191991   1
chr1    197192031   1
chr1    197192047   1
chr1    197192097   1
chr1    197192106   1
chr1    197192125   1
chr1    197192150   1

file3: 文件3:

    chrM    6423    2
    chrM    6432    1
    chrM    6766    1
    chrM    6785    1
    chrM    10075   1
    chrM    10084   1
    chrM    10998   7
    chrM    10999   8
    chrM    11024   7
    chrM    11025   14
    chrM    11117   8
chr1    197095943   1
chr1    197096144   1
chr1    197104061   1
chr1    197104257   1
chr1    197107805   2
chr1    197122470   1
chr1    197123085   1
chr1    197123093   1
chr1    197126978   1
chr1    197142562   1
chr1    197157076   1
chr1    197157101   2
chr1    197162035   4
chr1    197167431   1
chr1    197167470   1
chr1    197167535   1
chr1    197167652   1
chr1    197167668   1
chr1    197167682   1
chr1    197167715   1
chr1    197167734   1
chr1    197167755   1
chr1    197168107   2
chr1    197168113   2
chr1    197172198   1
chr1    197172211   1
chr1    197172221   1
chr1    197172271   1
chr1    197175787   1
chr1    197175806   1
chr1    197175822   1
chr1    197192150   0

resulting file should be like this: 结果文件应如下所示:

    6423    chrM    2   5   5
    6432    chrM    1   4   4
  6582  chrM    1
197093370   chr1    1
197093385   chr1    1
197094791   chr1    1
197094813   chr1    1
197094855   chr1    1
197094857   chr1    1
197095157   chr1    2
197095185   chr1    2
197095350   chr1    2
197095943   chr1    1
197096

Now my code is working properly.But with an issu in while loop that after merging many records nearly at the end of the merged file it stopped writing on the file and just wrote 197096 .... and stopped with an error Traceback (most recent call last): File "", line 4, in IndexError: list index out of range 现在我的代码可以正常工作了,但是在while循环中有一个issu,在合并了许多记录后,几乎在合并文件的末尾停止了对该文件的写操作,而只是写了197096 ....最后调用):文件“”,第4行,在IndexError中:列表索引超出范围

I think this error is related to while loop.I dont know why its happening.I am also changing me code as u can see below: 我认为这个错误与while循环有关。我不知道为什么会这样。我也在更改我的代码,如下所示:
Look her comes the problem:u can see clearly in resultant file that in this situation some thing is happening that after reading from single files code is not able to read common values from all files and also at this situation it is not giving 7575 that should come after7140. 看看她来的问题:您可以在结果文件中清楚地看到,在这种情况下,正在发生一些事情,即从单个文件读取后,代码无法从所有文件中读取公共值,并且在这种情况下,它没有给出应有的7575 7140之后

I have multiple files that are large and I want to read them all line by line and merge them together if they all have same value for column no.2 for which I used the logic of taking all the 2nd column val in a list and then found the smallest value of them. 我有多个大文件,如果它们对于第2列都具有相同的值,我想逐行读取它们并将它们合并在一起,为此我使用了将所有第二列val放在列表中然后发现其中的最小值。 writing smallest value records (column 3 saved in mycover) from the files showed smallest value to a new file. 从文件中写入最小值的记录(保存在mycover中的第3列)将显示最小值的值记录到新文件中。 and then keep the track of files that are read to read next line from them in my_newfile[] and deleted the records that have been written to the file. 然后在my_newfile[]跟踪读取的文件以从文件中读取下一行,并删除已写入文件的记录。

Hope it will be sufficient to understand. 希望足以理解。 I dont know how to repeat the process until all the files reached their end so to read all the records from all files. 我不知道如何重复该过程,直到所有文件都到达末尾,以便从所有文件中读取所有记录。 My code is as follows: 我的代码如下:

    import sys
import glob
import errno
path = '*Sorted_Coverage.txt'   
filenames = glob.glob(path)  
files = [open(i, "r") for i in filenames]

p=1
mylist=[]
mychr=[]
mycover=[]
new_mychr=[]
new_mycover=[]
new_mylist=[]
myfile=[]
new_myfile=[]
ab=""
g=1
result_f = open('MERGING_water_onlyselected.txt', 'a')
for j in files: 
    line = j.readline()
    parts = line.split()
    mychr.append(parts[0])
    mycover.append(parts[2])
    mylist.append(parts[1])
    myfile.append(j)
mylist=map(int,mylist)
minval = min(mylist)
ind = [i for i, v in enumerate(mylist) if v == minval]
not_ind = [i for i, v in enumerate(mylist) if v != minval]
w=""
j=0
for j in xrange(len(ind)):  # writing  records to file with minimum value
    if(j==0):
        ab = (str(mylist[ind[j]])+'\t'+mychr[ind[j]]+'\t'+mycover[ind[j]])
    else:
        ab=ab+'\t'+mycover[ind[j]]

#smallest written on file

result_f.writelines(ab+'\n')
ab=""

for i in ind:
    new_myfile.append(myfile[i])

      #removing the records by index which have  been used from mylists .
for i in sorted(ind, reverse=True):
    del mylist[i]
    del mycover[i]
    del mychr[i]
    del myfile[i]


#how to iterate the following code from all records of all files till the end of each file
while(True):
    for i in xrange(len(new_myfile)):
        print len(new_myfile)       
        myfile.append(new_myfile[i])
        line = new_myfile[i].readline()
        parts = line.split()
        mychr.append(parts[0])
        mycover.append(parts[2])
        mylist.append(parts[1])
        new_myfile=[]
    mylist=map(int, mylist)
    minval = min(mylist)
    print minval
    print("list values:")
    print mylist
    ind = [i for i, v in enumerate(mylist) if v == minval]
    not_ind = [i for i, v in enumerate(mylist) if v !=  minval]
    k=0
    ab=""
    for j in xrange(len(ind)):  # writing  records to file with minimum value
        if(j==0):
            ab = (str(mylist[ind[j]])+'\t'+str(mychr[ind[j]])+'\t'+str(mycover[ind[j]]))
            k=k+1
        else:
            ab=ab+'\t'+str(mycover[ind[j]])
            k=k+1
    #smallest written on file
    result_f.writelines(ab+'\n')
    ab=""
    for i in ind:
        new_myfile.append(myfile[i])
      #removing the records by index which have  been used from mylists .
    for i in sorted(ind, reverse=True):
        del mylist[i]
        del mycover[i]
        del mychr[i]
        del myfile[i]
result_f.close()

I've been searching for a solution for many days but still could not find any. 我一直在寻找解决方案很多天,但仍然找不到任何解决方案。 I have no idea whether this code could be improved more or not as I'm quite new to python. 我不知道该代码是否可以进一步改进,因为我对python很陌生。

If anyone could be of help I shall be highly grateful. 如果有人可以提供帮助,我将不胜感激。

Basic solution 基本解决方案

This is quite a simple approach. 这是一个非常简单的方法。 I don't know how it may perform on large files (see my comments below). 我不知道它在大文件上的表现如何(请参阅下面的评论)。

I assume that all files are already sorted with respect to the second column . 我假设所有文件都已经针对第二列进行了排序 Also, I assume that the first column signatures ('chrM', 'chr1') remains the same for a fixed value in the 2nd column (I'll call this column 'id' below). 另外,我假设第一列签名('chrM','chr1')对于第二列中的固定值保持不变(我在下面将其称为“ id”)。

The algorithm is straightforward: 该算法很简单:

  1. read one line from each file (I call read lines 'items') 从每个文件读取一行(我称读取行为“项”)

  2. choose one 'item' with the smallest 'id' (any one) and compare it with 'current_item': 选择一个具有最小“ id”(任意一个)的“ item”,并将其与“ current_item”进行比较:

    if both have the same id: combine them else: write 'current_item' to file and replace it with 'item' 如果两者都具有相同的ID:将它们组合在一起:将“ current_item”写入文件并将其替换为“ item”

  3. read one line from the same file as 'item' was read (if any lines left) 从与读取“ item”相同的文件中读取一行(如果还有任何一行)

  4. repeat from 1. until all lines from all files are read. 从1.开始重复,直到读取所有文件中的所有行。


import glob
import numpy as np

path = './file[0-9]*'
filenames = glob.glob(path) 
files = [open(i, "r") for i in filenames] 
output_file = open('output_file', mode = 'a')

# last_ids[i] = last id number read from files[i]
# I choose np.array because of function np.argmin
last_ids = np.ones(shape = len(files)) * np.inf
last_items = [None] *len(files)

# Note: When we hit EOF in a file, the corresponding entries from "files", "last_items", and "last_ids" will be deleted

for i in range(len(files)):
    line = files[i].readline()
    if line:
        item = line.strip().split()
        last_ids[i] = int(item[1])
        last_items[i] = item

# Find an item with the smallest id 
pos = np.argmin(last_ids)
current_item = last_items[pos]
# Inverting positions, so that id is first
current_item[0], current_item[1] = current_item[1], current_item[0]  

while True:    
    # Read next item from the corresponding file
    line = files[pos].readline()
    if line:
        item = line.strip().split()
        last_ids[pos] = int(item[1])
        last_items[pos] = item
    else:
        # EOF in files[pos], so delete it from the lists
        files[pos].close()
        del(files[pos])
        del(last_items[pos])
        last_ids = np.delete(last_ids, pos)
        if last_ids.size == 0:
            # No more files to read from
            break 

    # Find an item with the smallest id 
    pos = np.argmin(last_ids)
    if last_items[pos][1] == current_item[0]:
        # combine:
        current_item.append(last_items[pos][2])
    else:
        # write current to file and replace:
        output_file.write(' '.join(current_item) + '\n')
        current_item = last_items[pos]
        current_item[0], current_item[1] = current_item[1], current_item[0]  

# The last item to write:
output_file.write(' '.join(current_item) + '\n')
output_file.close()

Small files solution: 小文件解决方案:

If all files were small enough to fit into memory, then the following code is definitely shorter. 如果所有文件都足够小以适合内存,那么以下代码肯定会更短。 Whether it's faster may depend on the data. 是否更快可能取决于数据。 (See comments below.) (请参阅下面的评论。)

import glob 
import pandas as pd

path = './file[0-9]*'    
filenames = glob.glob(path) 

df_list = []
# Read in all files and concatenate to a single data frame:
for file in filenames:
    df_list.append(pd.read_csv(file, header = None, sep = '\s+'))    
df = pd.concat(df_list)

# changing type for convenience:
df[2] = df[2].astype(str)
# sorting here is not necessary:
# df = df.sort_values(by = 1)

df2 = df.groupby(by = 1).aggregate({0:'first', 2: lambda x: ' '.join(x)})
df2.to_csv('output_file', header = None)
# (Columns in 'output_file' are separated by commas. )

Comments 评论

I tested both solutions on several input files with 1000-10000 lines. 我在1000-10000行的几个输入文件上测试了这两种解决方案。 Usually the basic solutions is faster (sometimes twice as fast as the other one). 通常,基本解决方案速度更快(有时是另一种解决方案速度的两倍)。 But it depends on the structure of data. 但这取决于数据的结构。 If there are many repeating 'id's, then pandas might be slightly more advantageous (by quite a small margin). 如果有很多重复的“ id”,那么熊猫可能会更具优势(相差很小)。

I think the both approaches can be combined with the use of pd.read_csv with options chunksize or iterator . 我认为这两种方法都可以与带有选项chunksizeiteratorpd.read_csv结合使用。 That way we could read in and operate on larger chunks of data (not single lines). 这样,我们就可以读入更大的数据块并对其进行操作(而不是单行)。 But I'm not sure now if it leads to a much faster code. 但是我现在不确定它是否会导致更快的代码。

If it fails (and if nobody finds a better way), you may consider running a map reduce algorithm on Amazon Web Services. 如果失败(并且没有人找到更好的方法),则可以考虑在Amazon Web Services上运行map reduce算法。 There's some work to fix all settings at the beginning but a map-reduce algorithm is very straightforward for this kind of problems. 在一开始有一些工作可以解决所有设置,但是map-reduce算法对于这类问题非常简单。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM