如何寫文件的特定行長？

Question

我有這樣的序列（超過9000）：

>TsM_000224500 
MTTKWPQTTVTVATLSWGMLRLSMPKVQTTYKVTQSRGPLLAPGICDSWSRCLVLRVYVDRRRPGGDGSLGRVAVTVVETGCFGSAASFSMWVFGLAFVVTIEEQLL
>TsM_000534500 
MHSHIVTVFVALLLTTAVVYAHIGMHGEGCTTLQCQRHAFMMKEREKLNEMQLELMEMLMDIQTMNEQEAYYAGLHGAGMQQPLPMPIQ
>TsM_000355900 
MESGEENEYPMSCNIEEEEDIKFEPENGKVAEHESGEKKESIFVKHDDAKWVGIGFAIGTAVAPAVLSGISSAAVQGIRQPIQAGRNNGETTEDLENLINSVEDDL

包含“>”的行是ID，帶有字母的行是氨基酸（aa）序列。 我需要刪除（或移至其他文件）低於40 aa和超過4000 aa的序列。 然后，生成的文件應僅包含該范圍內的序列（> = 40 aa和<= 4K aa）。

我嘗試編寫以下腳本：

def read_seq(file_name):
    with open(file_name) as file:
        return file.read().split('\n')[0:]

ts = read_seq("/home/tiago/t_solium/ts_phtm0less.txt")

tsf = open("/home/tiago/t_solium/ts_secp-404k", 'w')

for x in range(len(ts)):
    if ([x][0:1] != '>'):
        if (len([x]) > 40 or len([x]) < 4000):

            tsf.write('%s\n'%(x))

tsf.close()

print "OK!"

我做了一些修改，但是我得到的只是空文件或所有+9000序列。

Answer 1

在您的for循環中，由於使用range() （即0,1,2,3,4... ），所以x是一個迭代整數。 嘗試以下方法：

for x in ts:

這將為您提供ts每個元素x

另外，您不需要在x周圍加上括號； Python可以自己遍歷字符串中的字符。 將括號放在字符串中時，將其放入列表中，因此，例如，如果嘗試獲取x的第二個字符： [x][1] ，Python將嘗試獲取列表中的第二個元素你把x放進去，就會遇到問題。

編輯：要包括ID，請嘗試以下操作：

注意：我也將if (len(x) > 40 or len(x) < 4000)更改為if (len(x) > 40 and len(x) < 4000) -使用and代替or將給您結果您正在尋找。

for i, x in enumerate(ts): #NEW: enumerate ts to get the index of every iteration (stored as i)
    if (x[0] != '>'):
        if (len(x) > 40 and len(x) < 4000):
            tsf.write('%s\n'%(ts[i-1])) #NEW: write the ID number found on preceding line
            tsf.write('%s\n'%(x))

Answer 2

試試這個，簡單易懂。 它不會將整個文件加載到內存中，而是逐行遍歷文件。

tsf=open('output.txt','w') # open the output file
with open("yourfile",'r') as ts: # open the input file
    for line in ts: # iterate over each line of input file
        line=line.strip() # removes all whitespace at the start and end, including spaces, tabs, newlines and carriage returns.
        if line[0]=='>': # if line is an ID 
            continue # move to the next line
        else: # otherwise
            if (len(line)>40) or (len(line)<4000): # if line is in required length
                tsf.write('%s\n'%line) # write to output file

tsf.close() # done
print "OK!"

僅供參考，如果在unix環境中工作，也可以將awk用於單行解決方案：

cat yourinputfile.txt | grep -v '>' | awk 'length($0)>=40' | awk 'length($0)<=4000' > youroutputfile.txt

如何寫文件的特定行長？

問題描述

2 個解決方案

解決方案1
1 已采納 2016-06-24 14:02:56

解決方案2
0 2016-06-24 14:05:19

如何寫文件的特定行長？

問題描述

2 個解決方案

解決方案1 1 已采納 2016-06-24 14:02:56

解決方案2 0 2016-06-24 14:05:19

解決方案1
1 已采納 2016-06-24 14:02:56

解決方案2
0 2016-06-24 14:05:19