简体   繁体   English

如何在Python列表中基于带有特定正则表达式的最后一次出现选择最后一个值/索引?

[英]How to select the last value/index based on last occurence with a certain regular expression in a list in Python?

I'm performing certain calculations on a large .txt(tab delimited, 300+ columns, 1 000 000+ rows) file using following code: 我正在使用以下代码在大型.txt(制表符分隔,300 +列,1000000 +行)文件上执行某些计算:

samples = []
OTUnumber = []

with open('all.16S.uniq.txt','r') as file:
     for i,line in enumerate(file): 
        columns = line.strip().split('\t')[11:353]
        if i == 0: #headers are sample names so first row
            samples = columns #save sample names 
            OTUnumbers = [0 for s in samples] #set starting value as zero
        else:
            for n,v in enumerate(columns):
                if v > 0:
                    OTUnumber[n] = OTUnumber[n] + 1
                else:
                    continue

result = dict(zip(samples,OTUnumbers))

I'm having a question about a certain part of this code. 我对此代码的某些部分有疑问。 Code of interest: 感兴趣的代码:

columns = line.strip().split('\t')[11:353] ###row i is splitted and saved as a list

The .txt file has a lot of columns and I'm only interested in part of the columns. .txt文件有很多列,我只对部分列感兴趣。 I frequently generate these kind of .txt files and the columns of interest always start at index 11 but do not always end at index 353. The last columns are never columns of interest. 我经常生成此类.txt文件,并且感兴趣的列始终始于索引11,但并不总是结束于索引353。最后一列永远不会是感兴趣的列。 I want to "automate" this code so that Python performs the code on the columns of interest. 我想“自动化”此代码,以便Python在感兴趣的列上执行代码。

The name of all columns of interest start with "sample". 所有感兴趣的列的名称均以“ sample”开头。 So basically I want to select the last column with the regular expression "sample". 因此,基本上我想选择带有正则表达式“ sample”的最后一列。 Mind that I read a line of the file, split it, and then save it as a list (= columns ) Code I'm looking for : 请注意,我读取了文件的一行,将其拆分,然后将其另存为列表(= columns ), 我正在寻找以下代码

columns = line.strip().split('\t')[11:```LAST COLUMN WHICH STARTS WITH "sample"```]

Based upon some research on the web I tried following code, but it returns a SyntaxError. 基于对网络的一些研究,我尝试了以下代码,但它返回了SyntaxError。

columns = line.strip().split('\t') 11:columns.where(columns==^[sample]).last_valid_index()]

Any ideas how to write this code? 任何想法如何编写此代码?

UPDATE: 更新:

OTUnumber = []

import re

with open('all.16S.uniq.txt','r') as f_in:
    data = f_in.read()
    for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
        x=g.split('\t') #list containing all sample names

with open('all.16S.uniq.txt','r') as file:
     for i,line in enumerate(file): 
        columns = line.strip().split('\t')[x]
        if i == 0:
            samples = columns2 
            OTUnumber = [0 for s in samples] #
        else:
            for n,v in enumerate(columns):
                if int(v) > 0:
                    OTUnumber[n] = OTUnumber[n] + 1
                else:
                    continue

result = dict(zip(samples,OTUnumber))

returns error: TypeError: list indices must be integers or slices, not list 返回错误: TypeError: list indices must be integers or slices, not list

You could achieve this with simple regex (with flags set to re.MULTILINE ): 您可以使用简单的正则表达式(将标志设置为re.MULTILINE )来re.MULTILINE

import re

data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''

for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
    print(g.split('\t'))

Prints: 打印:

['sample11', 'sample12', 'sample13']
['sample21', 'sample22', 'sample23', 'sample24']
['sample31', 'sample32']

Edit (to read from file): 编辑(从文件读取):

import re

with open('all.16S.uniq.txt','r') as f_in:
    data = f_in.read()
    for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
        print(g.split('\t'))

Edit2: to get index of last column that contains sample: Edit2:获取包含样本的最后一列的索引:

import re

data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''

for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
    print('Index of last column is:', 11 + len(g.split('\t')))

Prints: 打印:

Index of last column is: 14
Index of last column is: 15
Index of last column is: 13

This is one approach using a custom function 这是使用自定义函数的一种方法

Ex: 例如:

def get_last_sample_index(columns):
    for ind, c in enumerate(reversed(columns), 1):  #Reverse columns
        if c.startswith("sample"):                  #Get last column with `sample`
            return ind
    return -1

with open('all.16S.uniq.txt','r') as file:
    for i,line in enumerate(file):
        columns = line.strip().split('\t')
        columns = columns[11:-get_last_sample_index(columns)+1]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在 Pandas 系列中找到与特定值匹配的最后一次出现索引? - How to find last occurence index matching a certain value in a Pandas Series? python中的正则表达式获取URL或路径中文件扩展名的最后一次出现 - Regular expression in python to get the last occurence of a file extension in a URL or path 正则表达式匹配字符串中的最后一次出现 - Regular expression match last occurence of year in string Python Regular Expression用于查找特定模式中最后一次出现的空格 - Python Regular Expression to find the last occurrence of whitespace in a certain pattern 正则表达式只捕获重复组的最后一次出现 - Regular expression only captures the last occurence of repeated group python查找列表中不是“无”的最后一个值的索引 - python find the index of the last value in a list that is not a “None” 是否可以在python中的列表的最后一个索引中插入值? - Is it possible to insert a value into the last index of a list in python? 如何修复for循环的输出列表总是在python中存储最后一个索引的最后一个值 - How to fix the list of output from for-loop always store the last value of last index in python 如何获取列表中最后一个索引的值 - How to get the value of the last index in a list 如何在python中的字符串中搜索最后一次出现的正则表达式? - How to search for the last occurrence of a regular expression in a string in python?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM