[英]Splitting a string into a list of integers with Python
This method inputs a file and the directory of the file.这个方法输入一个文件和文件的目录。 It contains a matrix of data, and needs to copy the first 20 columns of each row after the given row name and the corresponding letter for the row.
它包含一个数据矩阵,需要复制给定行名后每行的前 20 列以及该行对应的字母。 The first 3 lines of each file is skipped because it has unimportant information that is not needed, and it also doesn't need the data at the bottom of the file.
每个文件的前 3 行被跳过,因为它有不需要的不重要的信息,也不需要文件底部的数据。
For example a file would look like:例如一个文件看起来像:
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)
-blank line
unimportant information--------
unimportant information--------
The output of the method needs to print out a "matrix" in some given form.该方法的输出需要以某种给定的形式打印出一个“矩阵”。
So far the output gives a list of each row as a string, however I'm trying to figure out the best way to approach the problem.到目前为止,输出以字符串形式提供了每一行的列表,但是我试图找出解决问题的最佳方法。 I don't know how to ignore the unimportant information at the end of the files.
我不知道如何忽略文件末尾的不重要信息。 I don't know how to only retrieve the first 20 columns after the letter in each row, and I don't know how to ignore the row number and the row letter.
我不知道如何只检索每行字母后的前 20 列,也不知道如何忽略行号和行字母。
def pssmMatrix(self,ipFileName,directory):
dir = directory
filename = ipFileName
my_lst = []
#takes every file in fasta folder and put in files list
for f in os.listdir(dir):
#splits the file name into file name and its extension
file, file_ext = os.path.splitext(f)
if file == ipFileName:
with open(os.path.join(dir,f)) as file_object:
for _ in range(3):
next(file_object)
for line in file_object:
my_lst.append(' '.join(line.strip().split()))
return my_lst
Expected results:预期成绩:
['-1 2 -3 4 5 6 7'], ['3 -1 3 4 0 -2 1'], ['3 -1 3 6 0 -2 5']
Actual results:实际结果:
['1 F -1 2 -3 4 5 6 7'], ['2 L 3 -1 3 4 0 -2 1'], ['3 A 3 -1 3 6 0 -2 5'], [' '], [' unimportant info'], ['unimportant info']
Try this solution. 试试这个解决方案
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
text = """
unimportant information--------
unimportant information--------
-blank line
1 F -1 2 -3 4 5 6 7 (more columns of ints)
2 L 3 -1 3 4 0 -2 1 (more columns of ints)
3 A 3 -1 3 6 0 -2 5 (more columns of ints)"""
ignore_start = 5 # 0,1,2,3 = 4
expected_array = []
for index, line in enumerate(text.splitlines()):
if(index >= ignore_start):
if reg.search(line):
result = reg.search(line).group(0).strip()
# Use Result
expected_array.append(' '.join(result))
print(expected_array)
# Result: [
#'- 1 2 - 3 4 5 6 7',
#'3 - 1 3 4 0 - 2 1',
#'3 - 1 3 6 0 - 2 5'
#]
To drop the first two columns, you can change: 要删除前两列,您可以更改:
my_lst.append(' '.join(line.strip().split()))
to 至
my_lst.append(' '.join(line.strip().split()[2:]))
That will drop the first two columns after they've been split and before they've been joined back together. 在它们被拆分之后以及它们重新组合在一起之前,它将丢弃前两列。
To drop the last 3 irrelevant lines, maybe the simplest solution is just to change: 要删除最后3个不相关的行,也许最简单的解决方案就是改变:
return my_lst
to 至
return my_lst[:-3]
That will return everything except the last 3 lines. 这将返回除最后3行之外的所有内容。
Ok so it looks to me like you have a file with certain lines that you want and the lines that you want always start with a number followed by a letter. 好的,所以它看起来像你有一个文件,你想要的某些行,你想要的行总是以一个数字后跟一个字母开头。 So what we can do is apply a regular expression to this that only gets lines that match that pattern and only get the numbers after the pattern
所以我们可以做的是对它应用一个正则表达式,它只获得与该模式匹配的行,并且仅获取模式后的数字
The expression for this would look like (?<=[0-9]\\s[AZ]\\s)[0-9\\-\\s]+
这个表达式看起来像
(?<=[0-9]\\s[AZ]\\s)[0-9\\-\\s]+
import re
reg = re.compile(r'(?<=[0-9]\s[A-Z]\s)[0-9\-\s]+')
for line in file:
if reg.search(line):
result = reg.search(test).group(0)
# Use Result
my_lst.append(' '.join(result))
Hope that helps 希望有所帮助
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.