[英]how to retrieve string from a large file
I have written a code in which "IDS.txt" is a tab deliminated text file contains ID in the manner given below in which first column represent ID second starting index and third column ending index. 我编写了一个代码,其中“IDS.txt”是一个制表符分隔文本文件,其中包含下面给出的ID,其中第一列表示ID第二个起始索引和第三列结束索引。
IDs.txt-------
"complete.txt" “complete.txt”
the script which i have write given bellow to retrieve the string fragment according to "IDs.txt" it's NOT
working please help what changes should i make to correct the code 我所给出的波纹管写剧本根据“IDs.txt”检索字符串片段它
NOT
工作,请帮助我应该做出哪些改变来纠正码
with open("\Users\Zebrafish\Desktop\IDs.txt") as f: # will get input from the text
for line in f:
c = line.split("\t")
for i, x in enumerate(c): #passing values to start and end variables
if i == 1:
start = x
elif i == 2:
end = x
elif i == 0:
gene_name = x
infile = open("/Users/Zebrafish/Desktop/complete.txt") #file to get large string data
for seq in infile:
seqnew = seq.split("\t") # get data as single line
retrived = seqnew[int(start):int(end)] #retrieve fragment
print retrived
I don't know why you are splitting on \\t
in your complete.txt
file, here is your code optimized: 我不知道你为什么要在你的
complete.txt
文件中拆分\\t
,这里是你的代码优化:
ids = {}
with open('/Users/Zebrafish/Desktop/ASHISH/IDs.txt') as f:
for line in f:
if len(line.strip()):
# This makes sure you skip blank lines
id,start,end = line.split('\t')
ids[id] = (int(start),int(end))
# Here, I assume your `complete.txt` is a file with one long line.
with open('/Users/Zebrafish/Desktop/ASHISH/complete.txt') as f:
sequence = f.readline()
# For each id, fetch the sequence "chunk:
for id,value in ids.iteritems():
start, end = value
print('{} {}'.format(id,sequence[start-1:end]))
3MB is not huge (on a computer that can run Windows). 3MB并不大(在可以运行Windows的计算机上)。 Just load the second file into memory as a single string, to get the fragments:
只需将第二个文件作为单个字符串加载到内存中,即可获取片段:
# populate `id -> (start, end)` map
ids = {}
with open(r"\Users\Zebrafish\Desktop\ASHISH\IDs.txt") as id_file:
for line in id_file:
if line.strip(): # non-blank line
id, start, end = line.split()
ids[id] = int(start), int(end)
# load the file as a single string (ignoring whitespace)
with open("/Users/Zebrafish/Desktop/ASHISH/complete.txt") as seq_file:
s = "".join(seq_file.read().split()) # or re.sub("\s+", "", seq_file.read())
# print fragments
for id, (start, end) in ids.items():
print("{id} -> {fragment}".format(id=id, fragment=s[start:end]))
If complete.txt
file doesn't fit in memory; 如果
complete.txt
文件不适合内存; you could use mmap
, to access its content as a sequence of bytes without loading the whole file into memory: 您可以使用
mmap
,以字节序列的形式访问其内容,而无需将整个文件加载到内存中:
from mmap import ACCESS_READ, mmap
with open("complete.txt") as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
# use `s` here (assume that indices refer to the raw file in this case)
# e.g., `fragment = s[start:end]`
Remove the line: 删除行:
seqnew = seq.split("\t")
Just do: 做就是了:
retrieved = seqnew[int(start):int(end)]
will get the sub-string you want. 将获得您想要的子字符串。
Then you'll be able to: 那么你将能够:
print retrieved
Beware of the leading \\t
in IDs.txt
谨防
IDs.txt
中的领先\\t
>>> print "\ta\tb\tc"
a b c
>>> "\ta\tb\tc".split("\t")
['', 'a', 'b', 'c']
i == 0
refers to an empty text rather than the gene ID. i == 0
是指空文本而不是基因ID。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.