简体   繁体   English

将字符串解析为输出列表

[英]Parsing string into list of outputs

I'm working with a text file, which consists of many similar reports of the following structure:我正在使用一个文本文件,该文件包含许多具有以下结构的类似报告:

['NetNGlyc-1.0 Server Output - DTU Health Tech\n',
 '     Asn-Xaa-Ser/Thr sequons in the sequence output below are highlighted in blue.\n',
 '          Asparagines predicted to be N-glycosylated are highlighted in red.\n',
 "Output for 'Sequence'\n",
 'Name:  Sequence  Length:  923\n',
 'MERGLPLLCAVLALVLAPAGAFRNDKCGDTIKIESPGYLTSPGYPHSYHPSEKCEWLIQAPDPYQRIMINFNPHFDLEDR      80 \n',
 'DCKYDYVEVFDGENENGHFRGKFCGKIAPPPVVSSGPFLFIKFVSDYETHGAGFSIRYEIFKRGPECSQNYTTPSGVIKS     160 \n',
 'PGFPEKYPNSLECTYIVFVPKMSEIILEFESFDLEPDSNPPGGMFCRYDRLEIWDGFPDVGPHIGRYCGQKTPGRIRSSS     240 \n',
 'GILSMVFYTDSAIAKEGFSANYSVLQSSVSEDFKCMEALGMESGEIHSDQITASSQYSTNWSAERSRLNYPENGWTPGED     320 \n',
 'SYREWIQVDLGLLRFVTAVGTQGAISKETKKKYYVKTYKIDVSSNGEDWITIKEGNKPVLFQGNTNPTDVVVAVFPKPLI     400 \n',
 'TRFVRIKPATWETGISMRFEVYGCKITDYPCSGMLGMVSGLISDSQITSSNQGDRNWMPENIRLVTSRSGWALPPAPHSY     480 \n',
 'INEWLQIDLGEEKIVRGIIIQGGKHRENKVFMRKFKIGYSNNGSDWKMIMDDSKRKAKSFEGNNNYDTPELRTFPALSTR     560 \n',
 'FIRIYPERATHGGLGLRMELLGCEVEAPTAGPTTPNGNLVDECDDDQANCHSGTGDDFQLTGGTTVLATEKPTVIDSTIQ     640 \n',
 'SEFPTYGFNCEFGWGSHKTFCHWEHDNHVQLKWSVLTSKTGPIQDHTGDGNFIYSQADENQKGKVARLVSPVVYSQNSAH     720 \n',
 'CMTFWYHMSGSHVGTLRVKLRYQKPEEYDQLVWMAIGHQGDHWKEGRVLLHKSLKLYQVIFEGEIGKGNLGGIAVDDISI     800 \n',
 'NNHISQEDCAKPADLDKKNPEIKIDETGSTPGYEGEGEGDKNISRKPGNVLKTLDPILITIIAMSALGVLLGAVCGVVLY     880 \n',
 'CACWHNGMSERNLSALENYNFELVDGVKLKKDKLNTQSTYSEA\n',
 '................................................................................      80\n',
 '.....................................................................N..........     160\n',
 '................................................................................     240\n',
 '....................N...........................................................     320\n',
 '.................................................................N..............     400\n',
 '................................................................................     480\n',
 '................................................................................     560\n',
 '................................................................................     640\n',
 '................................................................................     720\n',
 '................................................................................     800\n',
 '................................................................................     880\n',
 '...........................................                                          960\n',
 '\n',
 '(Threshold=0.5)\n',
 '----------------------------------------------------------------------\n',
 'SeqName      Position  Potential   Jury    N-Glyc\n',
 '     agreement result\n',
 '----------------------------------------------------------------------\n',
 'Sequence     150 NYTT   0.5361     (5/9)   +     \n',
 'Sequence     261 NYSV   0.5599     (6/9)   +     \n',
 'Sequence     300 NWSA   0.4157     (6/9)   -     \n',
 'Sequence     386 NPTD   0.7736     (9/9)   +++  WARNING: PRO-X1. \n',
 'Sequence     522 NGSD   0.3918     (9/9)   --    \n',
 'Sequence     842 NISR   0.4662     (6/9)   -     \n',
 'Sequence     892 NLSA   0.4099     (6/9)   -     \n',
 '----------------------------------------------------------------------\n',
 '\n',
 '\n',
 'Graphics in PostScript\n',
 '\n',
 '\n',
 'Go back.\n']

The resulting file that I'm trying to get is a list of elements, where each element would be a string, containing only the info that I want to be left.我试图得到的结果文件是一个元素列表,其中每个元素都是一个字符串,只包含我想要留下的信息。 The final list structure that I'm trying to get is something like that:我试图获得的最终列表结构是这样的:

['Sequence     150 NYTT   0.5361     (5/9)   +     \n
 Sequence     261 NYSV   0.5599     (6/9)   +     \n
 Sequence     300 NWSA   0.4157     (6/9)   -     \n',

'Sequence     150 NYTT   0.5361     (5/9)   +     \n
 Sequence     261 NYSV   0.5599     (6/9)   +     \n
 Sequence     300 NWSA   0.4157     (6/9)   -     \n
 Sequence     466 NYSV   0.6178     (7/9)   +     \n
 Sequence     300 NWSA   0.4157     (6/9)   -     \n',

'Sequence     150 NYTT   0.5361     (5/9)   +     \n
 Sequence     261 NYSV   0.5599     (6/9)   +     \n
 Sequence     300 NWSA   0.4157     (6/9)   -     \n',
...]

I managed to separate the outputs with the following code:我设法用以下代码分隔输出:

import re

with open('/path_to_text_file/file.txt', 'r') as file:
    test_output = file.readlines()

test_string = ''.join(map(str, test_output))  # convert the list into string

# here I decided to split the outputs by 'Go back' substring
# 1. first split by "\n\n" preceeding the 'Go back' substring
# 2. then by ".\n" following the 'Go back'
# 3. then by "\n" left 

test_string_split = ' '.join(map(str, ' '.join(map(str, test_string.split('\n\n'))).split('.\n')))


# split into element by *'Go back'* substring
processed_test = ''.join(test_string_split).split('Go back')

Now what I have in my hands is a list of elements, where each element comprises a single output.现在我手中有一个元素列表,其中每个元素都包含一个 output。 But I haven't managed yet to strip this outputs of all unnecessary text preserving the structure of the list, where each element came from a single report.但是我还没有设法去除所有不必要的文本的输出,保留列表的结构,其中每个元素都来自一个报告。 I tried the following logic:我尝试了以下逻辑:

res = [] # create a list for the final result

# split each output in the text file by '\n'
for output in processed_test: 
    output_split = ''.join(output).split('\n')

    # then check each line of the output for the 'Sequence' substring
    for string in output_split:
        string_el = ''.join(string)
        if re.match("Sequence.*", string_el): 
            res.append(string_el) # save matches to the resulting list

But what I get is a list of elements, where each element comprises a separate "Sequence"-line:但我得到的是一个元素列表,其中每个元素都包含一个单独的“序列”行:

['Sequence     522 NGSD   0.3918     (9/9)   --    ',
 'Sequence     842 NISR   0.4662     (6/9)   -     ',
 'Sequence     892 NLSA   0.4099     (6/9)   -     ',
 'Sequence      63 NYTV   0.7796     (9/9)   +++   ',
 'Sequence     209 NITL   0.7032     (8/9)   +     ',
 'Sequence     297 NVSI   0.6256     (8/9)   +     ',
 'Sequence     365 NLSQ   0.6403     (7/9)   +     ',
 'Sequence     522 NTSH   0.5207     (6/9)   +     ',
 'Sequence     696 NCSI   0.6619     (9/9)   ++    ',
...
...
...]

Is there a way of parsing a list inside the elements themselves so as to preserve the structure of the list?有没有办法解析元素本身内部的列表以保留列表的结构? The idea is that I need to understand from which report comes the info on the sequences.这个想法是我需要了解序列信息来自哪个报告。

try this where input is your current output.试试这个,输入是你当前的 output。 This splits your list into 3 parts.这会将您的列表分为 3 个部分。

import numpy as np

input = ""

output = []

splitted = np.array_split(input, 3)

for listt in splitted:
    output.append("\n".join(listt))

print(output)

IIUC you wanat to do the following: IIUC 你想做以下事情:

  • Read in the sequence lines as different reports读入序列行作为不同的报告
  • Place the multiple reports into a Dataframe将多个报告放入 Dataframe
  • Output the dataframe as a CSV file Output dataframe 作为 CSV 文件

That can be done as follows:可以这样做:

Code代码

import ast
import os

def make_reports(file_path):
   
    with open(file_path, 'r') as f:
        stack = [[]]                     # start with 1st report empty

        # Convert string into Python list
        lines = ast.literal_eval(f.read())

        for line in lines:
            # Loop through all lines in list
            if line.startswith('Sequence'):
                # Append Sequence to current group
                stack[-1].append(line)
            elif line.startswith('Go back'):
                stack.append([])    # Start new report

    # Convert to a dataframe, with each Report enumeratd (i.e. 0, 1, 2, ...)
    dfs = []
    for i, seqs in enumerate(stack):
        if seqs:
            # TWo column dataframe: Sequence and Report number
            dfs.append(pd.DataFrame({f'Sequences':seqs, 'Report':[i]*len(seqs)}))

    result = pd.concat(dfs, ignore_index=True, sort=False)

    # Write to results file (uses input file path and append -result to name)
    result.to_csv(f'{os.path.splitext(file_path)[0]}-result.txt', 
                  encoding='utf-8', 
                  index=False)
    return result

Usage用法

make_reports('test.txt') make_reports('test.txt')

Input File : test.txt输入文件:test.txt

Obtained by replicating posted data two more times to obtain multiple reports通过将发布的数据复制两次以获得多个报告来获得

['NetNGlyc-1.0 Server Output - DTU Health Tech\n',
 '     Asn-Xaa-Ser/Thr sequons in the sequence output below are highlighted in blue.\n',
 '          Asparagines predicted to be N-glycosylated are highlighted in red.\n',
 "Output for 'Sequence'\n",
 'Name:  Sequence  Length:  923\n',
 'MERGLPLLCAVLALVLAPAGAFRNDKCGDTIKIESPGYLTSPGYPHSYHPSEKCEWLIQAPDPYQRIMINFNPHFDLEDR      80 \n',
 'DCKYDYVEVFDGENENGHFRGKFCGKIAPPPVVSSGPFLFIKFVSDYETHGAGFSIRYEIFKRGPECSQNYTTPSGVIKS     160 \n',
 'PGFPEKYPNSLECTYIVFVPKMSEIILEFESFDLEPDSNPPGGMFCRYDRLEIWDGFPDVGPHIGRYCGQKTPGRIRSSS     240 \n',
 'GILSMVFYTDSAIAKEGFSANYSVLQSSVSEDFKCMEALGMESGEIHSDQITASSQYSTNWSAERSRLNYPENGWTPGED     320 \n',
 'SYREWIQVDLGLLRFVTAVGTQGAISKETKKKYYVKTYKIDVSSNGEDWITIKEGNKPVLFQGNTNPTDVVVAVFPKPLI     400 \n',
 'TRFVRIKPATWETGISMRFEVYGCKITDYPCSGMLGMVSGLISDSQITSSNQGDRNWMPENIRLVTSRSGWALPPAPHSY     480 \n',
 'INEWLQIDLGEEKIVRGIIIQGGKHRENKVFMRKFKIGYSNNGSDWKMIMDDSKRKAKSFEGNNNYDTPELRTFPALSTR     560 \n',
 'FIRIYPERATHGGLGLRMELLGCEVEAPTAGPTTPNGNLVDECDDDQANCHSGTGDDFQLTGGTTVLATEKPTVIDSTIQ     640 \n',
 'SEFPTYGFNCEFGWGSHKTFCHWEHDNHVQLKWSVLTSKTGPIQDHTGDGNFIYSQADENQKGKVARLVSPVVYSQNSAH     720 \n',
 'CMTFWYHMSGSHVGTLRVKLRYQKPEEYDQLVWMAIGHQGDHWKEGRVLLHKSLKLYQVIFEGEIGKGNLGGIAVDDISI     800 \n',
 'NNHISQEDCAKPADLDKKNPEIKIDETGSTPGYEGEGEGDKNISRKPGNVLKTLDPILITIIAMSALGVLLGAVCGVVLY     880 \n',
 'CACWHNGMSERNLSALENYNFELVDGVKLKKDKLNTQSTYSEA\n',
 '................................................................................      80\n',
 '.....................................................................N..........     160\n',
 '................................................................................     240\n',
 '....................N...........................................................     320\n',
 '.................................................................N..............     400\n',
 '................................................................................     480\n',
 '................................................................................     560\n',
 '................................................................................     640\n',
 '................................................................................     720\n',
 '................................................................................     800\n',
 '................................................................................     880\n',
 '...........................................                                          960\n',
 '\n',
 '(Threshold=0.5)\n',
 '----------------------------------------------------------------------\n',
 'SeqName      Position  Potential   Jury    N-Glyc\n',
 '     agreement result\n',
 '----------------------------------------------------------------------\n',
 'Sequence     150 NYTT   0.5361     (5/9)   +     \n',
 'Sequence     261 NYSV   0.5599     (6/9)   +     \n',
 'Sequence     300 NWSA   0.4157     (6/9)   -     \n',
 'Sequence     386 NPTD   0.7736     (9/9)   +++  WARNING: PRO-X1. \n',
 'Sequence     522 NGSD   0.3918     (9/9)   --    \n',
 'Sequence     842 NISR   0.4662     (6/9)   -     \n',
 'Sequence     892 NLSA   0.4099     (6/9)   -     \n',
 '----------------------------------------------------------------------\n',
 '\n',
 '\n',
 'Graphics in PostScript\n',
 '\n',
 '\n',
 'Go back.\n',
 'NetNGlyc-1.0 Server Output - DTU Health Tech\n',
 '     Asn-Xaa-Ser/Thr sequons in the sequence output below are highlighted in blue.\n',
 '          Asparagines predicted to be N-glycosylated are highlighted in red.\n',
 "Output for 'Sequence'\n",
 'Name:  Sequence  Length:  923\n',
 'MERGLPLLCAVLALVLAPAGAFRNDKCGDTIKIESPGYLTSPGYPHSYHPSEKCEWLIQAPDPYQRIMINFNPHFDLEDR      80 \n',
 'DCKYDYVEVFDGENENGHFRGKFCGKIAPPPVVSSGPFLFIKFVSDYETHGAGFSIRYEIFKRGPECSQNYTTPSGVIKS     160 \n',
 'PGFPEKYPNSLECTYIVFVPKMSEIILEFESFDLEPDSNPPGGMFCRYDRLEIWDGFPDVGPHIGRYCGQKTPGRIRSSS     240 \n',
 'GILSMVFYTDSAIAKEGFSANYSVLQSSVSEDFKCMEALGMESGEIHSDQITASSQYSTNWSAERSRLNYPENGWTPGED     320 \n',
 'SYREWIQVDLGLLRFVTAVGTQGAISKETKKKYYVKTYKIDVSSNGEDWITIKEGNKPVLFQGNTNPTDVVVAVFPKPLI     400 \n',
 'TRFVRIKPATWETGISMRFEVYGCKITDYPCSGMLGMVSGLISDSQITSSNQGDRNWMPENIRLVTSRSGWALPPAPHSY     480 \n',
 'INEWLQIDLGEEKIVRGIIIQGGKHRENKVFMRKFKIGYSNNGSDWKMIMDDSKRKAKSFEGNNNYDTPELRTFPALSTR     560 \n',
 'FIRIYPERATHGGLGLRMELLGCEVEAPTAGPTTPNGNLVDECDDDQANCHSGTGDDFQLTGGTTVLATEKPTVIDSTIQ     640 \n',
 'SEFPTYGFNCEFGWGSHKTFCHWEHDNHVQLKWSVLTSKTGPIQDHTGDGNFIYSQADENQKGKVARLVSPVVYSQNSAH     720 \n',
 'CMTFWYHMSGSHVGTLRVKLRYQKPEEYDQLVWMAIGHQGDHWKEGRVLLHKSLKLYQVIFEGEIGKGNLGGIAVDDISI     800 \n',
 'NNHISQEDCAKPADLDKKNPEIKIDETGSTPGYEGEGEGDKNISRKPGNVLKTLDPILITIIAMSALGVLLGAVCGVVLY     880 \n',
 'CACWHNGMSERNLSALENYNFELVDGVKLKKDKLNTQSTYSEA\n',
 '................................................................................      80\n',
 '.....................................................................N..........     160\n',
 '................................................................................     240\n',
 '....................N...........................................................     320\n',
 '.................................................................N..............     400\n',
 '................................................................................     480\n',
 '................................................................................     560\n',
 '................................................................................     640\n',
 '................................................................................     720\n',
 '................................................................................     800\n',
 '................................................................................     880\n',
 '...........................................                                          960\n',
 '\n',
 '(Threshold=0.5)\n',
 '----------------------------------------------------------------------\n',
 'SeqName      Position  Potential   Jury    N-Glyc\n',
 '     agreement result\n',
 '----------------------------------------------------------------------\n',
 'Sequence     150 NYTT   0.5361     (5/9)   +     \n',
 'Sequence     261 NYSV   0.5599     (6/9)   +     \n',
 'Sequence     300 NWSA   0.4157     (6/9)   -     \n',
 'Sequence     386 NPTD   0.7736     (9/9)   +++  WARNING: PRO-X1. \n',
 'Sequence     522 NGSD   0.3918     (9/9)   --    \n',
 'Sequence     842 NISR   0.4662     (6/9)   -     \n',
 'Sequence     892 NLSA   0.4099     (6/9)   -     \n',
 '----------------------------------------------------------------------\n',
 '\n',
 '\n',
 'Graphics in PostScript\n',
 '\n',
 '\n',
 'Go back.\n',
 'NetNGlyc-1.0 Server Output - DTU Health Tech\n',
 '     Asn-Xaa-Ser/Thr sequons in the sequence output below are highlighted in blue.\n',
 '          Asparagines predicted to be N-glycosylated are highlighted in red.\n',
 "Output for 'Sequence'\n",
 'Name:  Sequence  Length:  923\n',
 'MERGLPLLCAVLALVLAPAGAFRNDKCGDTIKIESPGYLTSPGYPHSYHPSEKCEWLIQAPDPYQRIMINFNPHFDLEDR      80 \n',
 'DCKYDYVEVFDGENENGHFRGKFCGKIAPPPVVSSGPFLFIKFVSDYETHGAGFSIRYEIFKRGPECSQNYTTPSGVIKS     160 \n',
 'PGFPEKYPNSLECTYIVFVPKMSEIILEFESFDLEPDSNPPGGMFCRYDRLEIWDGFPDVGPHIGRYCGQKTPGRIRSSS     240 \n',
 'GILSMVFYTDSAIAKEGFSANYSVLQSSVSEDFKCMEALGMESGEIHSDQITASSQYSTNWSAERSRLNYPENGWTPGED     320 \n',
 'SYREWIQVDLGLLRFVTAVGTQGAISKETKKKYYVKTYKIDVSSNGEDWITIKEGNKPVLFQGNTNPTDVVVAVFPKPLI     400 \n',
 'TRFVRIKPATWETGISMRFEVYGCKITDYPCSGMLGMVSGLISDSQITSSNQGDRNWMPENIRLVTSRSGWALPPAPHSY     480 \n',
 'INEWLQIDLGEEKIVRGIIIQGGKHRENKVFMRKFKIGYSNNGSDWKMIMDDSKRKAKSFEGNNNYDTPELRTFPALSTR     560 \n',
 'FIRIYPERATHGGLGLRMELLGCEVEAPTAGPTTPNGNLVDECDDDQANCHSGTGDDFQLTGGTTVLATEKPTVIDSTIQ     640 \n',
 'SEFPTYGFNCEFGWGSHKTFCHWEHDNHVQLKWSVLTSKTGPIQDHTGDGNFIYSQADENQKGKVARLVSPVVYSQNSAH     720 \n',
 'CMTFWYHMSGSHVGTLRVKLRYQKPEEYDQLVWMAIGHQGDHWKEGRVLLHKSLKLYQVIFEGEIGKGNLGGIAVDDISI     800 \n',
 'NNHISQEDCAKPADLDKKNPEIKIDETGSTPGYEGEGEGDKNISRKPGNVLKTLDPILITIIAMSALGVLLGAVCGVVLY     880 \n',
 'CACWHNGMSERNLSALENYNFELVDGVKLKKDKLNTQSTYSEA\n',
 '................................................................................      80\n',
 '.....................................................................N..........     160\n',
 '................................................................................     240\n',
 '....................N...........................................................     320\n',
 '.................................................................N..............     400\n',
 '................................................................................     480\n',
 '................................................................................     560\n',
 '................................................................................     640\n',
 '................................................................................     720\n',
 '................................................................................     800\n',
 '................................................................................     880\n',
 '...........................................                                          960\n',
 '\n',
 '(Threshold=0.5)\n',
 '----------------------------------------------------------------------\n',
 'SeqName      Position  Potential   Jury    N-Glyc\n',
 '     agreement result\n',
 '----------------------------------------------------------------------\n',
 'Sequence     150 NYTT   0.5361     (5/9)   +     \n',
 'Sequence     261 NYSV   0.5599     (6/9)   +     \n',
 'Sequence     300 NWSA   0.4157     (6/9)   -     \n',
 'Sequence     386 NPTD   0.7736     (9/9)   +++  WARNING: PRO-X1. \n',
 'Sequence     522 NGSD   0.3918     (9/9)   --    \n',
 'Sequence     842 NISR   0.4662     (6/9)   -     \n',
 'Sequence     892 NLSA   0.4099     (6/9)   -     \n',
 '----------------------------------------------------------------------\n',
 '\n',
 '\n',
 'Graphics in PostScript\n',
 '\n',
 '\n',
 'Go back.\n']

Output File: test-results.txt Output文件:test-results.txt

Columns are Sequences, Report (for report index)列是序列,报告(用于报告索引)

Sequences,Report
"Sequence     150 NYTT   0.5361     (5/9)   +     
",0
"Sequence     261 NYSV   0.5599     (6/9)   +     
",0
"Sequence     300 NWSA   0.4157     (6/9)   -     
",0
"Sequence     386 NPTD   0.7736     (9/9)   +++  WARNING: PRO-X1. 
",0
"Sequence     522 NGSD   0.3918     (9/9)   --    
",0
"Sequence     842 NISR   0.4662     (6/9)   -     
",0
"Sequence     892 NLSA   0.4099     (6/9)   -     
",0
"Sequence     150 NYTT   0.5361     (5/9)   +     
",1
"Sequence     261 NYSV   0.5599     (6/9)   +     
",1
"Sequence     300 NWSA   0.4157     (6/9)   -     
",1
"Sequence     386 NPTD   0.7736     (9/9)   +++  WARNING: PRO-X1. 
",1
"Sequence     522 NGSD   0.3918     (9/9)   --    
",1
"Sequence     842 NISR   0.4662     (6/9)   -     
",1
"Sequence     892 NLSA   0.4099     (6/9)   -     
",1
"Sequence     150 NYTT   0.5361     (5/9)   +     
",2
"Sequence     261 NYSV   0.5599     (6/9)   +     
",2
"Sequence     300 NWSA   0.4157     (6/9)   -     
",2
"Sequence     386 NPTD   0.7736     (9/9)   +++  WARNING: PRO-X1. 
",2
"Sequence     522 NGSD   0.3918     (9/9)   --    
",2
"Sequence     842 NISR   0.4662     (6/9)   -     
",2
"Sequence     892 NLSA   0.4099     (6/9)   -     
",2

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM