如何從大文件中檢索字符串

Question

我編寫了一個代碼，其中“IDS.txt”是一個制表符分隔文本文件，其中包含下面給出的ID，其中第一列表示ID第二個起始索引和第三列結束索引。

IDs.txt-------

“complete.txt”

我所給出的波紋管寫劇本根據“IDs.txt”檢索字符串片段它NOT工作，請幫助我應該做出哪些改變來糾正碼

with open("\Users\Zebrafish\Desktop\IDs.txt") as f: # will get input from the text
    for line in f:
        c = line.split("\t")                  
        for i, x in enumerate(c):                #passing values to  start and end variables 
            if i == 1:
               start = x
            elif i == 2:
                end =  x
            elif i == 0:
                 gene_name = x
        infile = open("/Users/Zebrafish/Desktop/complete.txt")  #file to get large string data 
        for seq in infile:
            seqnew = seq.split("\t")                       # get data as single line 
        retrived = seqnew[int(start):int(end)]             #retrieve  fragment 
        print retrived

Answer 1

我不知道你為什么要在你的complete.txt文件中拆分\\t ，這里是你的代碼優化：

ids = {}
with open('/Users/Zebrafish/Desktop/ASHISH/IDs.txt') as f:
    for line in f:
       if len(line.strip()):
           # This makes sure you skip blank lines
           id,start,end = line.split('\t')
           ids[id] = (int(start),int(end))

# Here, I assume your `complete.txt` is a file with one long line.
with open('/Users/Zebrafish/Desktop/ASHISH/complete.txt') as f:
    sequence = f.readline()

# For each id, fetch the sequence "chunk:
for id,value in ids.iteritems():
    start, end = value
    print('{} {}'.format(id,sequence[start-1:end]))

Answer 2

3MB並不大（在可以運行Windows的計算機上）。 只需將第二個文件作為單個字符串加載到內存中，即可獲取片段：

# populate `id -> (start, end)` map
ids = {} 
with open(r"\Users\Zebrafish\Desktop\ASHISH\IDs.txt") as id_file:
    for line in id_file:
        if line.strip(): # non-blank line
           id, start, end = line.split() 
           ids[id] = int(start), int(end)

# load the file as a single string (ignoring whitespace)
with open("/Users/Zebrafish/Desktop/ASHISH/complete.txt") as seq_file:
    s = "".join(seq_file.read().split()) # or re.sub("\s+", "", seq_file.read())

# print fragments
for id, (start, end) in ids.items():
    print("{id} -> {fragment}".format(id=id, fragment=s[start:end]))

如果complete.txt文件不適合內存; 您可以使用mmap ，以字節序列的形式訪問其內容，而無需將整個文件加載到內存中：

from mmap import ACCESS_READ, mmap    

with open("complete.txt") as f, mmap(f.fileno(), 0, access=ACCESS_READ) as s:
    # use `s` here (assume that indices refer to the raw file in this case)
    # e.g., `fragment = s[start:end]`

Answer 3

刪除行：

seqnew = seq.split("\t")

做就是了：

retrieved = seqnew[int(start):int(end)]

將獲得您想要的子字符串。

那么你將能夠：

print retrieved

Answer 4

謹防IDs.txt中的領先\\t

>>> print "\ta\tb\tc"
        a       b       c
>>> "\ta\tb\tc".split("\t")
['', 'a', 'b', 'c']

i == 0是指空文本而不是基因ID。

如何從大文件中檢索字符串

問題描述

4 個解決方案

解決方案1
1 已采納 2014-02-06 07:06:14

解決方案2
1 2014-02-06 07:24:42

解決方案3
0 2014-02-06 07:05:56

解決方案4
0 2014-02-06 07:23:21

如何從大文件中檢索字符串

問題描述

4 個解決方案

解決方案1 1 已采納 2014-02-06 07:06:14

解決方案2 1 2014-02-06 07:24:42

解決方案3 0 2014-02-06 07:05:56

解決方案4 0 2014-02-06 07:23:21

解決方案1
1 已采納 2014-02-06 07:06:14

解決方案2
1 2014-02-06 07:24:42

解決方案3
0 2014-02-06 07:05:56

解決方案4
0 2014-02-06 07:23:21