简体   繁体   English

使用数字列表的Python子序列

[英]Python subsetting sequence using list of numbers

I am trying to write a program that takes a file that has a list of numbers and use each of these numbers to subset a part of a string. 我正在尝试编写一个程序,该程序采用具有数字列表的文件,并使用这些数字中的每一个来子集一部分字符串。 When I try to call my function (below) I get the error: 当我尝试调用函数(如下)时,出现错误:

TypeError: unsupported operand type(s) for -: 'str' and 'int'

I tried changing the i in the for loop to int(i) in case, for some reason, i wasn't an integer but that resulted in the following error: 我尝试将for循环中的i更改为int(i) ,以防万一,由于某种原因, i不是整数,但导致以下错误:

ValueError: invalid literal for int() with base 10: ''

Code: 码:

#Function Collects Sequences and Writes to a Files
def gen_insertion_seq(index, seq, gene):
    output = open("%s_insertion_seq.txt" % gene, 'w')
    indices = index.read()
    sequence = seq.read()
    for i in indices:
        site = sequence[i-9:i+15]
        output.write(site + '\n')

#Open Index Files
shaker_index = open("212_index.txt")
kir2_index = open("214_index.txt")
asic1a_index = open("216_index.txt")
nachra7_index = open("252_index.txt")

#Open Sequence Files
shaker_seq = open("212_seq.txt")
kir2_seq = open("214_seq.txt")
asic1a_seq = open("216_seq.txt")
nachra7_seq = open("252_seq.txt")
#Call function on Index and Sequence Files - Should output list of generated Sequences for insertion sites.
#Must hand check first couple
gen_insertion_seq(shaker_index, shaker_seq, 'shaker')

Sample input files: 输入文件样本:

212_index.txt 212_index.txt

1312
210
633
696
1475
637
1198
645
1504
361
651
...

212_seq.txt 212_seq.txt

ATGGCCGCCGTGGCACTGCGAGAACAACAGCTCCAACGAAATAGTCTGGATGGATACGGTTCACTGCCTAAACTGTCTAGCCAAGACGAAGAAGGTGGCGCCGGCCATGGCTTCGGTGGGGGC

The errors in your code are caused by the fact that read does not do quite what you seem to expect. 代码中的错误是由于read不完全符合您的预期所致。 Called without parameters , it reads the entire file into a string. 调用时不带参数 ,它将整个文件读入字符串。 You then iterate over the characters in the string instead of the numbers in the file. 然后,您遍历字符串中的字符而不是文件中的数字。 The TypeError happens when you do '1' - 9 in the index to the sequence. 当您在序列的索引中执行'1' - 9 TypeError '1' - 9 ,会发生TypeError

Your intuition to convert the iterated values to int is basically correct. 您将迭代值转换为int直觉基本上是正确的。 However, since you are still iterating over characters, you get int('1') , int('3') , int('1') , int('2') , followed by a ValueError from int('\\n') . 不过,既然你还在遍历字符,你会得到int('1') int('3') int('1') int('2')接着是ValueErrorint('\\n') read reads in the entire file as-is, newlines and all. read按原样,换行符和所有内容读取整个文件。

Fortunately, the file object is iterable over the lines in the file. 幸运的是, 文件对象可以在文件中的各行之间进行迭代 This means that you can do something like for line in file: ... , and line will take on the string value of each index you want to parse. 这意味着您可以for line in file: ...做一些操作,并且line将采用您要解析的每个索引的字符串值。 It has the added bonus that line endings are stripped from the line, meaning that you could pass it directly into int with no further modification, for example. 它具有将行尾从行中剥离的额外好处,例如,您可以将其直接传递给int而无需进一步修改。

There are a number of additional improvements you can make to your code, including the corrections that would make it work properly. 您可以对代码进行许多其他改进,包括可以使其正常工作的更正。

  1. As per @Acccumulation's advice, open files in a with block to ensure that they get cleaned up properly if the program crashes, eg from an I/O error. 根据@Acccumulation的建议,在with块中打开文件,以确保在程序崩溃(例如,由于I / O错误)而正确清理文件时。 It will also close the file automatically when the block ends, which is something you are currently not doing at all (but should be) 块结束时,它还将自动关闭文件,这是您目前根本没有做的(但应该这样做)

  2. Conceptually, there is no need for you to pass around file objects at all. 从概念上讲,您根本不需要传递文件对象。 You only use each one in one place for one purpose. 您只能将一个地方的一个地方用于一种目的。 I would even extend this to recommend that you write a small function to parse each file type into a usable format and pass that around instead. 我什至将其扩展为建议您编写一个小函数以将每种文件类型解析为可用格式,然后将其传递。

  3. Files are iterable by line in Python. 文件在Python中可逐行迭代。 This is especially handy for your index files, which are a very line-oriented format. 这对于索引文件非常方便,因为索引文件是非常面向行的格式。 You do not need to do a full read at all, and can save a couple of steps from @MaximTitarenko's comment. 您根本不需要完全read ,并且可以从@MaximTitarenko的注释中节省几个步骤。

  4. You can use str.join directly on a file to combine any sequences that have line breaks in them. 您可以直接在文件上使用str.join来组合其中包含换行符的任何序列。

Combining all that advice, you could do the following: 结合所有建议,您可以执行以下操作:

def read_indices(fname):
    with open(fname, 'r') as file:
        return [int(index) for index in file]

def read_sequence(fname):
    with open(fname, 'r') as file:
        return ''.join(file)

Since files are iterables of strings, you can use them in list comprehensions and string join operations like that. 由于文件是字符串的可迭代项,因此您可以在列表推导和字符串联接操作中使用它们。 The rest of your code will now look much cleaner: 您的其余代码现在看起来会更干净:

def gen_insertion_seq(index, seq, gene):
    indices = read_indices(index)
    sequence = read_sequence(seq)
    with open("%s_insertion_seq.txt" % gene, 'w') as output:
        for i in indices:
            site = sequence[i-9:i+15]
            output.write(site + '\n')

gen_insertion_seq('212_index.txt', '212_seq.txt', 'shaker')
gen_insertion_seq('214_index.txt', '214_seq.txt', 'kir2')
gen_insertion_seq('216_index.txt', '216_seq.txt', 'asic1a')
gen_insertion_seq('252_index.txt', '252_seq.txt', 'nachra7')

Your main function is now easier to understand because it focuses only on the sequences and not on things like I/O and parsing. 您的主要功能现在更易于理解,因为它只关注序列,而不关注I / O和解析之类的东西。 You also don't have a bunch of open file handles floating around, waiting for an error. 您也没有一堆打开的文件句柄在漂浮,等待错误。 In fact, the file operations are all self contained, away from the real task. 实际上,文件操作都是独立的,而不是真正的任务。

If you had sequences ( in the Python sense ) of file IDs and gene names, you could further simplify the call to your function with a loop: 如果您具有文件ID和基因名称的序列( 从Python的角度来看 ),则可以通过循环进一步简化对函数的调用:

for id, gene in zip((212, 214, 216, 252), ('shaker', 'kir2', 'asic1a', 'nachra7')):
    gen_insertion_seq('%d_index.txt' % id, '%d_seq.txt' % id, gene)

PS. PS。 The I/O section in the Python tutorial is really nice. Python教程中的I / O部分非常好。 The section on files may be of especial interest to you. 您可能会对文件部分特别感兴趣。

Try inputting 'shaker' with double quotes, "shaker" . 试着输入'shaker'用双引号, "shaker" Or, use str(gene) in your function. 或者,在函数中使用str(gene)。

OK, I just realized it's python so quotes thing shouldn't matter, I think 好的,我只是意识到这是python,所以引号无关紧要,我认为

Or open("{}_insertion_seq.txt".format(gene), 'w') open("{}_insertion_seq.txt".format(gene), 'w')

If it's at the write, change output.write(site + '\\n') to output.write(str(site) + '\\n') 如果在写操作, output.write(site + '\\n')更改为output.write(str(site) + '\\n')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM