![](/img/trans.png)
[英]Always outputs “is not in the list” no matter if the string is in the list or not
[英]Parsing string into list of outputs
我正在使用一個文本文件,該文件包含許多具有以下結構的類似報告:
['NetNGlyc-1.0 Server Output - DTU Health Tech\n',
' Asn-Xaa-Ser/Thr sequons in the sequence output below are highlighted in blue.\n',
' Asparagines predicted to be N-glycosylated are highlighted in red.\n',
"Output for 'Sequence'\n",
'Name: Sequence Length: 923\n',
'MERGLPLLCAVLALVLAPAGAFRNDKCGDTIKIESPGYLTSPGYPHSYHPSEKCEWLIQAPDPYQRIMINFNPHFDLEDR 80 \n',
'DCKYDYVEVFDGENENGHFRGKFCGKIAPPPVVSSGPFLFIKFVSDYETHGAGFSIRYEIFKRGPECSQNYTTPSGVIKS 160 \n',
'PGFPEKYPNSLECTYIVFVPKMSEIILEFESFDLEPDSNPPGGMFCRYDRLEIWDGFPDVGPHIGRYCGQKTPGRIRSSS 240 \n',
'GILSMVFYTDSAIAKEGFSANYSVLQSSVSEDFKCMEALGMESGEIHSDQITASSQYSTNWSAERSRLNYPENGWTPGED 320 \n',
'SYREWIQVDLGLLRFVTAVGTQGAISKETKKKYYVKTYKIDVSSNGEDWITIKEGNKPVLFQGNTNPTDVVVAVFPKPLI 400 \n',
'TRFVRIKPATWETGISMRFEVYGCKITDYPCSGMLGMVSGLISDSQITSSNQGDRNWMPENIRLVTSRSGWALPPAPHSY 480 \n',
'INEWLQIDLGEEKIVRGIIIQGGKHRENKVFMRKFKIGYSNNGSDWKMIMDDSKRKAKSFEGNNNYDTPELRTFPALSTR 560 \n',
'FIRIYPERATHGGLGLRMELLGCEVEAPTAGPTTPNGNLVDECDDDQANCHSGTGDDFQLTGGTTVLATEKPTVIDSTIQ 640 \n',
'SEFPTYGFNCEFGWGSHKTFCHWEHDNHVQLKWSVLTSKTGPIQDHTGDGNFIYSQADENQKGKVARLVSPVVYSQNSAH 720 \n',
'CMTFWYHMSGSHVGTLRVKLRYQKPEEYDQLVWMAIGHQGDHWKEGRVLLHKSLKLYQVIFEGEIGKGNLGGIAVDDISI 800 \n',
'NNHISQEDCAKPADLDKKNPEIKIDETGSTPGYEGEGEGDKNISRKPGNVLKTLDPILITIIAMSALGVLLGAVCGVVLY 880 \n',
'CACWHNGMSERNLSALENYNFELVDGVKLKKDKLNTQSTYSEA\n',
'................................................................................ 80\n',
'.....................................................................N.......... 160\n',
'................................................................................ 240\n',
'....................N........................................................... 320\n',
'.................................................................N.............. 400\n',
'................................................................................ 480\n',
'................................................................................ 560\n',
'................................................................................ 640\n',
'................................................................................ 720\n',
'................................................................................ 800\n',
'................................................................................ 880\n',
'........................................... 960\n',
'\n',
'(Threshold=0.5)\n',
'----------------------------------------------------------------------\n',
'SeqName Position Potential Jury N-Glyc\n',
' agreement result\n',
'----------------------------------------------------------------------\n',
'Sequence 150 NYTT 0.5361 (5/9) + \n',
'Sequence 261 NYSV 0.5599 (6/9) + \n',
'Sequence 300 NWSA 0.4157 (6/9) - \n',
'Sequence 386 NPTD 0.7736 (9/9) +++ WARNING: PRO-X1. \n',
'Sequence 522 NGSD 0.3918 (9/9) -- \n',
'Sequence 842 NISR 0.4662 (6/9) - \n',
'Sequence 892 NLSA 0.4099 (6/9) - \n',
'----------------------------------------------------------------------\n',
'\n',
'\n',
'Graphics in PostScript\n',
'\n',
'\n',
'Go back.\n']
我試圖得到的結果文件是一個元素列表,其中每個元素都是一個字符串,只包含我想要留下的信息。 我試圖獲得的最終列表結構是這樣的:
['Sequence 150 NYTT 0.5361 (5/9) + \n
Sequence 261 NYSV 0.5599 (6/9) + \n
Sequence 300 NWSA 0.4157 (6/9) - \n',
'Sequence 150 NYTT 0.5361 (5/9) + \n
Sequence 261 NYSV 0.5599 (6/9) + \n
Sequence 300 NWSA 0.4157 (6/9) - \n
Sequence 466 NYSV 0.6178 (7/9) + \n
Sequence 300 NWSA 0.4157 (6/9) - \n',
'Sequence 150 NYTT 0.5361 (5/9) + \n
Sequence 261 NYSV 0.5599 (6/9) + \n
Sequence 300 NWSA 0.4157 (6/9) - \n',
...]
我設法用以下代碼分隔輸出:
import re
with open('/path_to_text_file/file.txt', 'r') as file:
test_output = file.readlines()
test_string = ''.join(map(str, test_output)) # convert the list into string
# here I decided to split the outputs by 'Go back' substring
# 1. first split by "\n\n" preceeding the 'Go back' substring
# 2. then by ".\n" following the 'Go back'
# 3. then by "\n" left
test_string_split = ' '.join(map(str, ' '.join(map(str, test_string.split('\n\n'))).split('.\n')))
# split into element by *'Go back'* substring
processed_test = ''.join(test_string_split).split('Go back')
現在我手中有一個元素列表,其中每個元素都包含一個 output。 但是我還沒有設法去除所有不必要的文本的輸出,保留列表的結構,其中每個元素都來自一個報告。 我嘗試了以下邏輯:
res = [] # create a list for the final result
# split each output in the text file by '\n'
for output in processed_test:
output_split = ''.join(output).split('\n')
# then check each line of the output for the 'Sequence' substring
for string in output_split:
string_el = ''.join(string)
if re.match("Sequence.*", string_el):
res.append(string_el) # save matches to the resulting list
但我得到的是一個元素列表,其中每個元素都包含一個單獨的“序列”行:
['Sequence 522 NGSD 0.3918 (9/9) -- ',
'Sequence 842 NISR 0.4662 (6/9) - ',
'Sequence 892 NLSA 0.4099 (6/9) - ',
'Sequence 63 NYTV 0.7796 (9/9) +++ ',
'Sequence 209 NITL 0.7032 (8/9) + ',
'Sequence 297 NVSI 0.6256 (8/9) + ',
'Sequence 365 NLSQ 0.6403 (7/9) + ',
'Sequence 522 NTSH 0.5207 (6/9) + ',
'Sequence 696 NCSI 0.6619 (9/9) ++ ',
...
...
...]
有沒有辦法解析元素本身內部的列表以保留列表的結構? 這個想法是我需要了解序列信息來自哪個報告。
試試這個,輸入是你當前的 output。 這會將您的列表分為 3 個部分。
import numpy as np
input = ""
output = []
splitted = np.array_split(input, 3)
for listt in splitted:
output.append("\n".join(listt))
print(output)
IIUC 你想做以下事情:
可以這樣做:
代碼
import ast
import os
def make_reports(file_path):
with open(file_path, 'r') as f:
stack = [[]] # start with 1st report empty
# Convert string into Python list
lines = ast.literal_eval(f.read())
for line in lines:
# Loop through all lines in list
if line.startswith('Sequence'):
# Append Sequence to current group
stack[-1].append(line)
elif line.startswith('Go back'):
stack.append([]) # Start new report
# Convert to a dataframe, with each Report enumeratd (i.e. 0, 1, 2, ...)
dfs = []
for i, seqs in enumerate(stack):
if seqs:
# TWo column dataframe: Sequence and Report number
dfs.append(pd.DataFrame({f'Sequences':seqs, 'Report':[i]*len(seqs)}))
result = pd.concat(dfs, ignore_index=True, sort=False)
# Write to results file (uses input file path and append -result to name)
result.to_csv(f'{os.path.splitext(file_path)[0]}-result.txt',
encoding='utf-8',
index=False)
return result
用法
make_reports('test.txt')
輸入文件:test.txt
通過將發布的數據復制兩次以獲得多個報告來獲得
['NetNGlyc-1.0 Server Output - DTU Health Tech\n',
' Asn-Xaa-Ser/Thr sequons in the sequence output below are highlighted in blue.\n',
' Asparagines predicted to be N-glycosylated are highlighted in red.\n',
"Output for 'Sequence'\n",
'Name: Sequence Length: 923\n',
'MERGLPLLCAVLALVLAPAGAFRNDKCGDTIKIESPGYLTSPGYPHSYHPSEKCEWLIQAPDPYQRIMINFNPHFDLEDR 80 \n',
'DCKYDYVEVFDGENENGHFRGKFCGKIAPPPVVSSGPFLFIKFVSDYETHGAGFSIRYEIFKRGPECSQNYTTPSGVIKS 160 \n',
'PGFPEKYPNSLECTYIVFVPKMSEIILEFESFDLEPDSNPPGGMFCRYDRLEIWDGFPDVGPHIGRYCGQKTPGRIRSSS 240 \n',
'GILSMVFYTDSAIAKEGFSANYSVLQSSVSEDFKCMEALGMESGEIHSDQITASSQYSTNWSAERSRLNYPENGWTPGED 320 \n',
'SYREWIQVDLGLLRFVTAVGTQGAISKETKKKYYVKTYKIDVSSNGEDWITIKEGNKPVLFQGNTNPTDVVVAVFPKPLI 400 \n',
'TRFVRIKPATWETGISMRFEVYGCKITDYPCSGMLGMVSGLISDSQITSSNQGDRNWMPENIRLVTSRSGWALPPAPHSY 480 \n',
'INEWLQIDLGEEKIVRGIIIQGGKHRENKVFMRKFKIGYSNNGSDWKMIMDDSKRKAKSFEGNNNYDTPELRTFPALSTR 560 \n',
'FIRIYPERATHGGLGLRMELLGCEVEAPTAGPTTPNGNLVDECDDDQANCHSGTGDDFQLTGGTTVLATEKPTVIDSTIQ 640 \n',
'SEFPTYGFNCEFGWGSHKTFCHWEHDNHVQLKWSVLTSKTGPIQDHTGDGNFIYSQADENQKGKVARLVSPVVYSQNSAH 720 \n',
'CMTFWYHMSGSHVGTLRVKLRYQKPEEYDQLVWMAIGHQGDHWKEGRVLLHKSLKLYQVIFEGEIGKGNLGGIAVDDISI 800 \n',
'NNHISQEDCAKPADLDKKNPEIKIDETGSTPGYEGEGEGDKNISRKPGNVLKTLDPILITIIAMSALGVLLGAVCGVVLY 880 \n',
'CACWHNGMSERNLSALENYNFELVDGVKLKKDKLNTQSTYSEA\n',
'................................................................................ 80\n',
'.....................................................................N.......... 160\n',
'................................................................................ 240\n',
'....................N........................................................... 320\n',
'.................................................................N.............. 400\n',
'................................................................................ 480\n',
'................................................................................ 560\n',
'................................................................................ 640\n',
'................................................................................ 720\n',
'................................................................................ 800\n',
'................................................................................ 880\n',
'........................................... 960\n',
'\n',
'(Threshold=0.5)\n',
'----------------------------------------------------------------------\n',
'SeqName Position Potential Jury N-Glyc\n',
' agreement result\n',
'----------------------------------------------------------------------\n',
'Sequence 150 NYTT 0.5361 (5/9) + \n',
'Sequence 261 NYSV 0.5599 (6/9) + \n',
'Sequence 300 NWSA 0.4157 (6/9) - \n',
'Sequence 386 NPTD 0.7736 (9/9) +++ WARNING: PRO-X1. \n',
'Sequence 522 NGSD 0.3918 (9/9) -- \n',
'Sequence 842 NISR 0.4662 (6/9) - \n',
'Sequence 892 NLSA 0.4099 (6/9) - \n',
'----------------------------------------------------------------------\n',
'\n',
'\n',
'Graphics in PostScript\n',
'\n',
'\n',
'Go back.\n',
'NetNGlyc-1.0 Server Output - DTU Health Tech\n',
' Asn-Xaa-Ser/Thr sequons in the sequence output below are highlighted in blue.\n',
' Asparagines predicted to be N-glycosylated are highlighted in red.\n',
"Output for 'Sequence'\n",
'Name: Sequence Length: 923\n',
'MERGLPLLCAVLALVLAPAGAFRNDKCGDTIKIESPGYLTSPGYPHSYHPSEKCEWLIQAPDPYQRIMINFNPHFDLEDR 80 \n',
'DCKYDYVEVFDGENENGHFRGKFCGKIAPPPVVSSGPFLFIKFVSDYETHGAGFSIRYEIFKRGPECSQNYTTPSGVIKS 160 \n',
'PGFPEKYPNSLECTYIVFVPKMSEIILEFESFDLEPDSNPPGGMFCRYDRLEIWDGFPDVGPHIGRYCGQKTPGRIRSSS 240 \n',
'GILSMVFYTDSAIAKEGFSANYSVLQSSVSEDFKCMEALGMESGEIHSDQITASSQYSTNWSAERSRLNYPENGWTPGED 320 \n',
'SYREWIQVDLGLLRFVTAVGTQGAISKETKKKYYVKTYKIDVSSNGEDWITIKEGNKPVLFQGNTNPTDVVVAVFPKPLI 400 \n',
'TRFVRIKPATWETGISMRFEVYGCKITDYPCSGMLGMVSGLISDSQITSSNQGDRNWMPENIRLVTSRSGWALPPAPHSY 480 \n',
'INEWLQIDLGEEKIVRGIIIQGGKHRENKVFMRKFKIGYSNNGSDWKMIMDDSKRKAKSFEGNNNYDTPELRTFPALSTR 560 \n',
'FIRIYPERATHGGLGLRMELLGCEVEAPTAGPTTPNGNLVDECDDDQANCHSGTGDDFQLTGGTTVLATEKPTVIDSTIQ 640 \n',
'SEFPTYGFNCEFGWGSHKTFCHWEHDNHVQLKWSVLTSKTGPIQDHTGDGNFIYSQADENQKGKVARLVSPVVYSQNSAH 720 \n',
'CMTFWYHMSGSHVGTLRVKLRYQKPEEYDQLVWMAIGHQGDHWKEGRVLLHKSLKLYQVIFEGEIGKGNLGGIAVDDISI 800 \n',
'NNHISQEDCAKPADLDKKNPEIKIDETGSTPGYEGEGEGDKNISRKPGNVLKTLDPILITIIAMSALGVLLGAVCGVVLY 880 \n',
'CACWHNGMSERNLSALENYNFELVDGVKLKKDKLNTQSTYSEA\n',
'................................................................................ 80\n',
'.....................................................................N.......... 160\n',
'................................................................................ 240\n',
'....................N........................................................... 320\n',
'.................................................................N.............. 400\n',
'................................................................................ 480\n',
'................................................................................ 560\n',
'................................................................................ 640\n',
'................................................................................ 720\n',
'................................................................................ 800\n',
'................................................................................ 880\n',
'........................................... 960\n',
'\n',
'(Threshold=0.5)\n',
'----------------------------------------------------------------------\n',
'SeqName Position Potential Jury N-Glyc\n',
' agreement result\n',
'----------------------------------------------------------------------\n',
'Sequence 150 NYTT 0.5361 (5/9) + \n',
'Sequence 261 NYSV 0.5599 (6/9) + \n',
'Sequence 300 NWSA 0.4157 (6/9) - \n',
'Sequence 386 NPTD 0.7736 (9/9) +++ WARNING: PRO-X1. \n',
'Sequence 522 NGSD 0.3918 (9/9) -- \n',
'Sequence 842 NISR 0.4662 (6/9) - \n',
'Sequence 892 NLSA 0.4099 (6/9) - \n',
'----------------------------------------------------------------------\n',
'\n',
'\n',
'Graphics in PostScript\n',
'\n',
'\n',
'Go back.\n',
'NetNGlyc-1.0 Server Output - DTU Health Tech\n',
' Asn-Xaa-Ser/Thr sequons in the sequence output below are highlighted in blue.\n',
' Asparagines predicted to be N-glycosylated are highlighted in red.\n',
"Output for 'Sequence'\n",
'Name: Sequence Length: 923\n',
'MERGLPLLCAVLALVLAPAGAFRNDKCGDTIKIESPGYLTSPGYPHSYHPSEKCEWLIQAPDPYQRIMINFNPHFDLEDR 80 \n',
'DCKYDYVEVFDGENENGHFRGKFCGKIAPPPVVSSGPFLFIKFVSDYETHGAGFSIRYEIFKRGPECSQNYTTPSGVIKS 160 \n',
'PGFPEKYPNSLECTYIVFVPKMSEIILEFESFDLEPDSNPPGGMFCRYDRLEIWDGFPDVGPHIGRYCGQKTPGRIRSSS 240 \n',
'GILSMVFYTDSAIAKEGFSANYSVLQSSVSEDFKCMEALGMESGEIHSDQITASSQYSTNWSAERSRLNYPENGWTPGED 320 \n',
'SYREWIQVDLGLLRFVTAVGTQGAISKETKKKYYVKTYKIDVSSNGEDWITIKEGNKPVLFQGNTNPTDVVVAVFPKPLI 400 \n',
'TRFVRIKPATWETGISMRFEVYGCKITDYPCSGMLGMVSGLISDSQITSSNQGDRNWMPENIRLVTSRSGWALPPAPHSY 480 \n',
'INEWLQIDLGEEKIVRGIIIQGGKHRENKVFMRKFKIGYSNNGSDWKMIMDDSKRKAKSFEGNNNYDTPELRTFPALSTR 560 \n',
'FIRIYPERATHGGLGLRMELLGCEVEAPTAGPTTPNGNLVDECDDDQANCHSGTGDDFQLTGGTTVLATEKPTVIDSTIQ 640 \n',
'SEFPTYGFNCEFGWGSHKTFCHWEHDNHVQLKWSVLTSKTGPIQDHTGDGNFIYSQADENQKGKVARLVSPVVYSQNSAH 720 \n',
'CMTFWYHMSGSHVGTLRVKLRYQKPEEYDQLVWMAIGHQGDHWKEGRVLLHKSLKLYQVIFEGEIGKGNLGGIAVDDISI 800 \n',
'NNHISQEDCAKPADLDKKNPEIKIDETGSTPGYEGEGEGDKNISRKPGNVLKTLDPILITIIAMSALGVLLGAVCGVVLY 880 \n',
'CACWHNGMSERNLSALENYNFELVDGVKLKKDKLNTQSTYSEA\n',
'................................................................................ 80\n',
'.....................................................................N.......... 160\n',
'................................................................................ 240\n',
'....................N........................................................... 320\n',
'.................................................................N.............. 400\n',
'................................................................................ 480\n',
'................................................................................ 560\n',
'................................................................................ 640\n',
'................................................................................ 720\n',
'................................................................................ 800\n',
'................................................................................ 880\n',
'........................................... 960\n',
'\n',
'(Threshold=0.5)\n',
'----------------------------------------------------------------------\n',
'SeqName Position Potential Jury N-Glyc\n',
' agreement result\n',
'----------------------------------------------------------------------\n',
'Sequence 150 NYTT 0.5361 (5/9) + \n',
'Sequence 261 NYSV 0.5599 (6/9) + \n',
'Sequence 300 NWSA 0.4157 (6/9) - \n',
'Sequence 386 NPTD 0.7736 (9/9) +++ WARNING: PRO-X1. \n',
'Sequence 522 NGSD 0.3918 (9/9) -- \n',
'Sequence 842 NISR 0.4662 (6/9) - \n',
'Sequence 892 NLSA 0.4099 (6/9) - \n',
'----------------------------------------------------------------------\n',
'\n',
'\n',
'Graphics in PostScript\n',
'\n',
'\n',
'Go back.\n']
Output文件:test-results.txt
列是序列,報告(用於報告索引)
Sequences,Report
"Sequence 150 NYTT 0.5361 (5/9) +
",0
"Sequence 261 NYSV 0.5599 (6/9) +
",0
"Sequence 300 NWSA 0.4157 (6/9) -
",0
"Sequence 386 NPTD 0.7736 (9/9) +++ WARNING: PRO-X1.
",0
"Sequence 522 NGSD 0.3918 (9/9) --
",0
"Sequence 842 NISR 0.4662 (6/9) -
",0
"Sequence 892 NLSA 0.4099 (6/9) -
",0
"Sequence 150 NYTT 0.5361 (5/9) +
",1
"Sequence 261 NYSV 0.5599 (6/9) +
",1
"Sequence 300 NWSA 0.4157 (6/9) -
",1
"Sequence 386 NPTD 0.7736 (9/9) +++ WARNING: PRO-X1.
",1
"Sequence 522 NGSD 0.3918 (9/9) --
",1
"Sequence 842 NISR 0.4662 (6/9) -
",1
"Sequence 892 NLSA 0.4099 (6/9) -
",1
"Sequence 150 NYTT 0.5361 (5/9) +
",2
"Sequence 261 NYSV 0.5599 (6/9) +
",2
"Sequence 300 NWSA 0.4157 (6/9) -
",2
"Sequence 386 NPTD 0.7736 (9/9) +++ WARNING: PRO-X1.
",2
"Sequence 522 NGSD 0.3918 (9/9) --
",2
"Sequence 842 NISR 0.4662 (6/9) -
",2
"Sequence 892 NLSA 0.4099 (6/9) -
",2
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.