[英]merge multiple files by reading them simultaneously line by line?
I have 3 files: 我有3个文件:
file1: 文件1:
chrM 6423 5
chrM 6432 4
chrM 7575 1
chrM 7670 1
chrM 7933 1
chrM 7984 1
chrM 8123 1
chrM 9944 1
chrM 10434 1
chrM 10998 13
chrM 10999 19
chrM 11024 17
chrM 11025 29
chrM 11117 21
chrM 11118 42
chr1 197095350 2
chr1 197103061 1
chr1 197103582 1
chr1 197103615 1
chr1 197103810 3
chr1 197103885 2
chr1 197104256 1
chr1 197107467 4
chr1 197107480 5
chr1 197107498 6
chr1 197107528 10
chr1 197107805 1
chr1 197107806 1
chr1 197107813 1
chr1 197107814 1
chr1 197107839 1
chr1 197107840 1
chr1 197107855 1
chr1 197107856 1
chr1 197107877 1
chr1 197107878 1
chr1 197111511 1
chr1 197120122 1
chr1 197125503 1
chr1 197126978 1
chr1 197127070 1
chr1 197127084 1
chr1 197129731 2
chr1 197129758 2
chr1 197129765 1
chr1 197167632 2
chr1 197167652 2
chr1 197167668 2
chr1 197167682 2
chr1 197181417 1
chr1 197181973 3
chr1 197181975 3
chr1 197192150 0
file2: 文件2:
chrM 6423 5
chrM 6432 4
chrM 6582 1
chrM 6640 1
chrM 6643 1
chrM 7140 1
chrM 10998 7
chrM 10999 8
chrM 11024 10
chrM 11025 13
chrM 11117 12
chrM 11118 33
chr1 197095157 2
chr1 197095185 2
chr1 197098860 1
chr1 197105061 1
chr1 197107422 1
chr1 197107436 1
chr1 197107467 3
chr1 197107480 4
chr1 197107498 3
chr1 197107528 4
chr1 197107805 2
chr1 197107813 2
chr1 197107839 1
chr1 197108557 1
chr1 197108591 1
chr1 197108596 1
chr1 197108617 1
chr1 197108651 1
chr1 197139308 1
chr1 197139335 1
chr1 197143403 1
chr1 197143442 1
chr1 197145546 1
chr1 197148715 1
chr1 197148723 1
chr1 197148731 1
chr1 197148761 1
chr1 197153190 1
chr1 197166831 1
chr1 197166847 2
chr1 197166922 2
chr1 197166950 1
chr1 197166954 1
chr1 197167041 1
chr1 197167778 1
chr1 197167791 1
chr1 197167834 1
chr1 197167857 2
chr1 197167860 2
chr1 197167865 1
chr1 197167867 1
chr1 197167871 1
chr1 197167935 2
chr1 197167946 2
chr1 197167948 2
chr1 197167951 2
chr1 197167974 1
chr1 197167980 1
chr1 197168142 1
chr1 197168163 1
chr1 197168195 1
chr1 197168210 1
chr1 197169548 1
chr1 197169580 1
chr1 197169609 1
chr1 197183318 1
chr1 197183404 1
chr1 197184910 1
chr1 197184937 1
chr1 197186368 1
chr1 197191991 1
chr1 197192031 1
chr1 197192047 1
chr1 197192097 1
chr1 197192106 1
chr1 197192125 1
chr1 197192150 1
file3: 文件3:
chrM 6423 2
chrM 6432 1
chrM 6766 1
chrM 6785 1
chrM 10075 1
chrM 10084 1
chrM 10998 7
chrM 10999 8
chrM 11024 7
chrM 11025 14
chrM 11117 8
chr1 197095943 1
chr1 197096144 1
chr1 197104061 1
chr1 197104257 1
chr1 197107805 2
chr1 197122470 1
chr1 197123085 1
chr1 197123093 1
chr1 197126978 1
chr1 197142562 1
chr1 197157076 1
chr1 197157101 2
chr1 197162035 4
chr1 197167431 1
chr1 197167470 1
chr1 197167535 1
chr1 197167652 1
chr1 197167668 1
chr1 197167682 1
chr1 197167715 1
chr1 197167734 1
chr1 197167755 1
chr1 197168107 2
chr1 197168113 2
chr1 197172198 1
chr1 197172211 1
chr1 197172221 1
chr1 197172271 1
chr1 197175787 1
chr1 197175806 1
chr1 197175822 1
chr1 197192150 0
resulting file should be like this: 结果文件应如下所示:
6423 chrM 2 5 5
6432 chrM 1 4 4
6582 chrM 1
197093370 chr1 1
197093385 chr1 1
197094791 chr1 1
197094813 chr1 1
197094855 chr1 1
197094857 chr1 1
197095157 chr1 2
197095185 chr1 2
197095350 chr1 2
197095943 chr1 1
197096
Now my code is working properly.But with an issu in while loop that after merging many records nearly at the end of the merged file it stopped writing on the file and just wrote 197096 .... and stopped with an error Traceback (most recent call last): File "", line 4, in IndexError: list index out of range 现在我的代码可以正常工作了,但是在while循环中有一个issu,在合并了许多记录后,几乎在合并文件的末尾停止了对该文件的写操作,而只是写了197096 ....最后调用):文件“”,第4行,在IndexError中:列表索引超出范围
I think this error is related to while loop.I dont know why its happening.I am also changing me code as u can see below: 我认为这个错误与while循环有关。我不知道为什么会这样。我也在更改我的代码,如下所示:
Look her comes the problem:u can see clearly in resultant file that in this situation some thing is happening that after reading from single files code is not able to read common values from all files and also at this situation it is not giving 7575 that should come after7140. 看看她来的问题:您可以在结果文件中清楚地看到,在这种情况下,正在发生一些事情,即从单个文件读取后,代码无法从所有文件中读取公共值,并且在这种情况下,它没有给出应有的7575 7140之后
I have multiple files that are large and I want to read them all line by line and merge them together if they all have same value for column no.2 for which I used the logic of taking all the 2nd column val in a list and then found the smallest value of them. 我有多个大文件,如果它们对于第2列都具有相同的值,我想逐行读取它们并将它们合并在一起,为此我使用了将所有第二列val放在列表中然后发现其中的最小值。 writing smallest value records (column 3 saved in mycover) from the files showed smallest value to a new file. 从文件中写入最小值的记录(保存在mycover中的第3列)将显示最小值的值记录到新文件中。 and then keep the track of files that are read to read next line from them in my_newfile[]
and deleted the records that have been written to the file. 然后在my_newfile[]
跟踪读取的文件以从文件中读取下一行,并删除已写入文件的记录。
Hope it will be sufficient to understand. 希望足以理解。 I dont know how to repeat the process until all the files reached their end so to read all the records from all files. 我不知道如何重复该过程,直到所有文件都到达末尾,以便从所有文件中读取所有记录。 My code is as follows: 我的代码如下:
import sys
import glob
import errno
path = '*Sorted_Coverage.txt'
filenames = glob.glob(path)
files = [open(i, "r") for i in filenames]
p=1
mylist=[]
mychr=[]
mycover=[]
new_mychr=[]
new_mycover=[]
new_mylist=[]
myfile=[]
new_myfile=[]
ab=""
g=1
result_f = open('MERGING_water_onlyselected.txt', 'a')
for j in files:
line = j.readline()
parts = line.split()
mychr.append(parts[0])
mycover.append(parts[2])
mylist.append(parts[1])
myfile.append(j)
mylist=map(int,mylist)
minval = min(mylist)
ind = [i for i, v in enumerate(mylist) if v == minval]
not_ind = [i for i, v in enumerate(mylist) if v != minval]
w=""
j=0
for j in xrange(len(ind)): # writing records to file with minimum value
if(j==0):
ab = (str(mylist[ind[j]])+'\t'+mychr[ind[j]]+'\t'+mycover[ind[j]])
else:
ab=ab+'\t'+mycover[ind[j]]
#smallest written on file
result_f.writelines(ab+'\n')
ab=""
for i in ind:
new_myfile.append(myfile[i])
#removing the records by index which have been used from mylists .
for i in sorted(ind, reverse=True):
del mylist[i]
del mycover[i]
del mychr[i]
del myfile[i]
#how to iterate the following code from all records of all files till the end of each file
while(True):
for i in xrange(len(new_myfile)):
print len(new_myfile)
myfile.append(new_myfile[i])
line = new_myfile[i].readline()
parts = line.split()
mychr.append(parts[0])
mycover.append(parts[2])
mylist.append(parts[1])
new_myfile=[]
mylist=map(int, mylist)
minval = min(mylist)
print minval
print("list values:")
print mylist
ind = [i for i, v in enumerate(mylist) if v == minval]
not_ind = [i for i, v in enumerate(mylist) if v != minval]
k=0
ab=""
for j in xrange(len(ind)): # writing records to file with minimum value
if(j==0):
ab = (str(mylist[ind[j]])+'\t'+str(mychr[ind[j]])+'\t'+str(mycover[ind[j]]))
k=k+1
else:
ab=ab+'\t'+str(mycover[ind[j]])
k=k+1
#smallest written on file
result_f.writelines(ab+'\n')
ab=""
for i in ind:
new_myfile.append(myfile[i])
#removing the records by index which have been used from mylists .
for i in sorted(ind, reverse=True):
del mylist[i]
del mycover[i]
del mychr[i]
del myfile[i]
result_f.close()
I've been searching for a solution for many days but still could not find any. 我一直在寻找解决方案很多天,但仍然找不到任何解决方案。 I have no idea whether this code could be improved more or not as I'm quite new to python. 我不知道该代码是否可以进一步改进,因为我对python很陌生。
If anyone could be of help I shall be highly grateful. 如果有人可以提供帮助,我将不胜感激。
This is quite a simple approach. 这是一个非常简单的方法。 I don't know how it may perform on large files (see my comments below). 我不知道它在大文件上的表现如何(请参阅下面的评论)。
I assume that all files are already sorted with respect to the second column . 我假设所有文件都已经针对第二列进行了排序 。 Also, I assume that the first column signatures ('chrM', 'chr1') remains the same for a fixed value in the 2nd column (I'll call this column 'id' below). 另外,我假设第一列签名('chrM','chr1')对于第二列中的固定值保持不变(我在下面将其称为“ id”)。
The algorithm is straightforward: 该算法很简单:
read one line from each file (I call read lines 'items') 从每个文件读取一行(我称读取行为“项”)
choose one 'item' with the smallest 'id' (any one) and compare it with 'current_item': 选择一个具有最小“ id”(任意一个)的“ item”,并将其与“ current_item”进行比较:
if both have the same id: combine them else: write 'current_item' to file and replace it with 'item' 如果两者都具有相同的ID:将它们组合在一起:将“ current_item”写入文件并将其替换为“ item”
read one line from the same file as 'item' was read (if any lines left) 从与读取“ item”相同的文件中读取一行(如果还有任何一行)
repeat from 1. until all lines from all files are read. 从1.开始重复,直到读取所有文件中的所有行。
import glob
import numpy as np
path = './file[0-9]*'
filenames = glob.glob(path)
files = [open(i, "r") for i in filenames]
output_file = open('output_file', mode = 'a')
# last_ids[i] = last id number read from files[i]
# I choose np.array because of function np.argmin
last_ids = np.ones(shape = len(files)) * np.inf
last_items = [None] *len(files)
# Note: When we hit EOF in a file, the corresponding entries from "files", "last_items", and "last_ids" will be deleted
for i in range(len(files)):
line = files[i].readline()
if line:
item = line.strip().split()
last_ids[i] = int(item[1])
last_items[i] = item
# Find an item with the smallest id
pos = np.argmin(last_ids)
current_item = last_items[pos]
# Inverting positions, so that id is first
current_item[0], current_item[1] = current_item[1], current_item[0]
while True:
# Read next item from the corresponding file
line = files[pos].readline()
if line:
item = line.strip().split()
last_ids[pos] = int(item[1])
last_items[pos] = item
else:
# EOF in files[pos], so delete it from the lists
files[pos].close()
del(files[pos])
del(last_items[pos])
last_ids = np.delete(last_ids, pos)
if last_ids.size == 0:
# No more files to read from
break
# Find an item with the smallest id
pos = np.argmin(last_ids)
if last_items[pos][1] == current_item[0]:
# combine:
current_item.append(last_items[pos][2])
else:
# write current to file and replace:
output_file.write(' '.join(current_item) + '\n')
current_item = last_items[pos]
current_item[0], current_item[1] = current_item[1], current_item[0]
# The last item to write:
output_file.write(' '.join(current_item) + '\n')
output_file.close()
If all files were small enough to fit into memory, then the following code is definitely shorter. 如果所有文件都足够小以适合内存,那么以下代码肯定会更短。 Whether it's faster may depend on the data. 是否更快可能取决于数据。 (See comments below.) (请参阅下面的评论。)
import glob
import pandas as pd
path = './file[0-9]*'
filenames = glob.glob(path)
df_list = []
# Read in all files and concatenate to a single data frame:
for file in filenames:
df_list.append(pd.read_csv(file, header = None, sep = '\s+'))
df = pd.concat(df_list)
# changing type for convenience:
df[2] = df[2].astype(str)
# sorting here is not necessary:
# df = df.sort_values(by = 1)
df2 = df.groupby(by = 1).aggregate({0:'first', 2: lambda x: ' '.join(x)})
df2.to_csv('output_file', header = None)
# (Columns in 'output_file' are separated by commas. )
I tested both solutions on several input files with 1000-10000 lines. 我在1000-10000行的几个输入文件上测试了这两种解决方案。 Usually the basic solutions is faster (sometimes twice as fast as the other one). 通常,基本解决方案速度更快(有时是另一种解决方案速度的两倍)。 But it depends on the structure of data. 但这取决于数据的结构。 If there are many repeating 'id's, then pandas might be slightly more advantageous (by quite a small margin). 如果有很多重复的“ id”,那么熊猫可能会更具优势(相差很小)。
I think the both approaches can be combined with the use of pd.read_csv
with options chunksize
or iterator
. 我认为这两种方法都可以与带有选项chunksize
或iterator
的pd.read_csv
结合使用。 That way we could read in and operate on larger chunks of data (not single lines). 这样,我们就可以读入更大的数据块并对其进行操作(而不是单行)。 But I'm not sure now if it leads to a much faster code. 但是我现在不确定它是否会导致更快的代码。
If it fails (and if nobody finds a better way), you may consider running a map reduce algorithm on Amazon Web Services. 如果失败(并且没有人找到更好的方法),则可以考虑在Amazon Web Services上运行map reduce算法。 There's some work to fix all settings at the beginning but a map-reduce algorithm is very straightforward for this kind of problems. 在一开始有一些工作可以解决所有设置,但是map-reduce算法对于这类问题非常简单。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.