[英]How can I add all keys' values and print the new dictionary?
I have a file1 has region information like chromosome1 from position 1 to position 10, looks like: chromosome,start_position,end_position 1,1,10 1,11,20
我有一个file1,它具有从位置1到位置10的区域信息,例如chromosome1,看起来像:
chromosome,start_position,end_position 1,1,10 1,11,20
A file2 has values for every position like position 6 on chromosome1 with some value, looks like: chromosome,position,value 1,1,value1 1,2,value2 1,6,value3 1,13,value4
file2具有每个位置的值,例如
chromosome,position,value 1,1,value1 1,2,value2 1,6,value3 1,13,value4
1上的位置6,具有一些值,看起来像: chromosome,position,value 1,1,value1 1,2,value2 1,6,value3 1,13,value4
I want to add values in file2 to file1, based on whether positions in file2 belongs to any region in file1 ,something like: chromosome,start_position,end_position,total_value 1,1,10,value1+value2+value3 1,11,20,value4
我想根据file2中的位置是否属于file1中的任何区域,将file2中的值添加到file1中,例如:
chromosome,start_position,end_position,total_value 1,1,10,value1+value2+value3 1,11,20,value4
Both files can be more than 10 million lines, Should I do this by looking through every line of file2 (to see if the position be in any region of file1), or making every line of file1 a dictionary (then find value in file2? then add?)? 这两个文件都可以超过1000万行,是否应该通过查看file2的每一行(以查看位置是否在file1的任何区域中),或者将file1的每一行都设置为字典(然后在file2中查找值)来实现?然后加?)?
And how can I get the 'total value' of every line in file1? 以及如何获取file1中每一行的“总值”? Thanks everyone!
感谢大家!
I'm presuming that you're not necessarily looking for the most efficient code, but one that gets the job done? 我以为您不一定要寻找最有效的代码,而是可以完成工作的代码?
I would read the values in file 2 into a dictionary, with the key being a (chromosome, start)
pair (presuming that the start and end are always the same in file 2). 我会将文件2中的值读入字典中,密钥是
(chromosome, start)
对(假定文件2中的开始和结束始终相同)。
Then read file 1 line-by-line, and find all relevant values in your "file 2" dictionary, appending the resultant sum to the end of the line (probably in a new file): 然后逐行读取文件1,并在“文件2”字典中找到所有相关值,并将结果总和附加到行尾(可能在新文件中):
import numpy as np
for line in file1:
chromosome, start, end = line.split(',')
total_value = np.sum([file2_dict.get([(chromosome,str(i))], 0) for i in
range(int(start), int(end)+1)])
#do something with the total value, maybe write to another file.
#Could do:
new_line = ','.join([chromosome, start, end, total_value]) + '\n'
I'm going to leave the rest of the implementation details to you (such as getting your dictionary from file 2). 我将把其余的实现细节留给您(例如,从文件2获取字典)。 It might be a bit heavy on memory usage, but hopefully not too bad.
可能会占用一些内存,但希望不会太糟。
Note the use of the .get()
method with the dictionary lookup - this will make sure that any key that isn't found in the dictionary returns 0
. 请注意在字典查找中使用
.get()
方法-这将确保字典中找不到的任何键都返回0
。 You decide if this works for your situation. 您可以决定这是否适合您的情况。 Also note the use of
str
and int
to convert between text and numbers. 还要注意使用
str
和int
在文本和数字之间进行转换。 You decide if this is appropriate based on your implementation. 您可以根据自己的实现来决定是否合适。
Also, if you haven't come across Python list comprehensions before, do some research on that. 另外,如果您以前没有遇到过Python列表理解,请对此进行一些研究。 That is what allows us to write the one-liner to get the sum of all relevant values.
这就是允许我们编写单行代码以获取所有相关值之和的原因。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.