简体   繁体   English

使用 Python 将字符串拆分为整数列表

[英]Splitting a string into a list of integers with Python

This method inputs a file and the directory of the file.这个方法输入一个文件和文件的目录。 It contains a matrix of data, and needs to copy the first 20 columns of each row after the given row name and the corresponding letter for the row.它包含一个数据矩阵,需要复制给定行名后每行的前 20 列以及该行对应的字母。 The first 3 lines of each file is skipped because it has unimportant information that is not needed, and it also doesn't need the data at the bottom of the file.每个文件的前 3 行被跳过,因为它有不需要的不重要的信息,也不需要文件底部的数据。

For example a file would look like:例如一个文件看起来像:

unimportant information--------
 unimportant information--------
 -blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)
-blank line
unimportant information--------
unimportant information--------

The output of the method needs to print out a "matrix" in some given form.该方法的输出需要以某种给定的形式打印出一个“矩阵”。

So far the output gives a list of each row as a string, however I'm trying to figure out the best way to approach the problem.到目前为止,输出以字符串形式提供了每一行的列表,但是我试图找出解决问题的最佳方法。 I don't know how to ignore the unimportant information at the end of the files.我不知道如何忽略文件末尾的不重要信息。 I don't know how to only retrieve the first 20 columns after the letter in each row, and I don't know how to ignore the row number and the row letter.我不知道如何只检索每行字母后的前 20 列,也不知道如何忽略行号和行字母。

def pssmMatrix(self,ipFileName,directory):
    dir = directory
    filename = ipFileName
    my_lst = []

    #takes every file in fasta folder and put in files list
    for f in os.listdir(dir):
        #splits the file name into file name and its extension
        file, file_ext = os.path.splitext(f)

        if file == ipFileName:
            with open(os.path.join(dir,f)) as file_object:

                for _ in range(3):
                    next(file_object)
                for line in file_object:
                        my_lst.append(' '.join(line.strip().split()))
    return my_lst

Expected results:预期成绩:

['-1 2 -3 4 5 6 7'], ['3 -1 3 4 0 -2 1'], ['3 -1 3 6 0 -2 5']

Actual results:实际结果:

['1 F -1 2 -3 4 5 6 7'], ['2 L 3 -1 3 4 0 -2 1'], ['3 A 3 -1 3 6 0 -2 5'],  [' '], [' unimportant info'], ['unimportant info']  

Try this solution. 试试这个解决方案

    import re
    reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')

    text = """
    unimportant information--------

    unimportant information--------
    -blank line

    1 F -1 2 -3 4 5 6 7 (more columns of ints)

    2 L 3 -1 3 4 0 -2 1 (more columns of ints)

    3 A 3 -1 3 6 0 -2 5 (more columns of ints)"""

    ignore_start = 5  # 0,1,2,3 =  4
    expected_array = []
    for index, line in enumerate(text.splitlines()):
    if(index >= ignore_start):
            if reg.search(line):
            result = reg.search(line).group(0).strip()
            # Use Result
            expected_array.append(' '.join(result))

    print(expected_array)
    # Result: [
    #'- 1   2   - 3   4   5   6   7', 
    #'3   - 1   3   4   0   - 2   1', 
    #'3   - 1   3   6   0   - 2   5'
    #]

To drop the first two columns, you can change: 要删除前两列,您可以更改:

my_lst.append(' '.join(line.strip().split()))

to

my_lst.append(' '.join(line.strip().split()[2:]))

That will drop the first two columns after they've been split and before they've been joined back together. 在它们被拆分之后以及它们重新组合在一起之前,它将丢弃前两列。

To drop the last 3 irrelevant lines, maybe the simplest solution is just to change: 要删除最后3个不相关的行,也许最简单的解决方案就是改变:

return my_lst

to

return my_lst[:-3]

That will return everything except the last 3 lines. 这将返回除最后3行之外的所有内容。

Ok so it looks to me like you have a file with certain lines that you want and the lines that you want always start with a number followed by a letter. 好的,所以它看起来像你有一个文件,你想要的某些行,你想要的行总是以一个数字后跟一个字母开头。 So what we can do is apply a regular expression to this that only gets lines that match that pattern and only get the numbers after the pattern 所以我们可以做的是对它应用一个正则表达式,它只获得与该模式匹配的行,并且仅获取模式后的数字

The expression for this would look like (?<=[0-9]\\s[AZ]\\s)[0-9\\-\\s]+ 这个表达式看起来像(?<=[0-9]\\s[AZ]\\s)[0-9\\-\\s]+

import re

reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')

for line in file:
    if reg.search(line):
        result = reg.search(test).group(0)
        # Use Result
        my_lst.append(' '.join(result))

Hope that helps 希望有所帮助

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM