如何在Python列表中基于带有特定正则表达式的最后一次出现选择最后一个值/索引？

Question

I'm performing certain calculations on a large .txt(tab delimited, 300+ columns, 1 000 000+ rows) file using following code: 我正在使用以下代码在大型.txt（制表符分隔，300 +列，1000000 +行）文件上执行某些计算：

samples = []
OTUnumber = []

with open('all.16S.uniq.txt','r') as file:
     for i,line in enumerate(file): 
        columns = line.strip().split('\t')[11:353]
        if i == 0: #headers are sample names so first row
            samples = columns #save sample names 
            OTUnumbers = [0 for s in samples] #set starting value as zero
        else:
            for n,v in enumerate(columns):
                if v > 0:
                    OTUnumber[n] = OTUnumber[n] + 1
                else:
                    continue

result = dict(zip(samples,OTUnumbers))

I'm having a question about a certain part of this code. 我对此代码的某些部分有疑问。 Code of interest: 感兴趣的代码：

columns = line.strip().split('\t')[11:353] ###row i is splitted and saved as a list

The .txt file has a lot of columns and I'm only interested in part of the columns. .txt文件有很多列，我只对部分列感兴趣。 I frequently generate these kind of .txt files and the columns of interest always start at index 11 but do not always end at index 353. The last columns are never columns of interest. 我经常生成此类.txt文件，并且感兴趣的列始终始于索引11，但并不总是结束于索引353。最后一列永远不会是感兴趣的列。 I want to "automate" this code so that Python performs the code on the columns of interest. 我想“自动化”此代码，以便Python在感兴趣的列上执行代码。

The name of all columns of interest start with "sample". 所有感兴趣的列的名称均以“ sample”开头。 So basically I want to select the last column with the regular expression "sample". 因此，基本上我想选择带有正则表达式“ sample”的最后一列。 Mind that I read a line of the file, split it, and then save it as a list (= columns ) Code I'm looking for : 请注意，我读取了文件的一行，将其拆分，然后将其另存为列表（= columns ）， 我正在寻找以下代码 ：

columns = line.strip().split('\t')[11:```LAST COLUMN WHICH STARTS WITH "sample"```]

Based upon some research on the web I tried following code, but it returns a SyntaxError. 基于对网络的一些研究，我尝试了以下代码，但它返回了SyntaxError。

columns = line.strip().split('\t') 11:columns.where(columns==^[sample]).last_valid_index()]

Any ideas how to write this code? 任何想法如何编写此代码？

UPDATE: 更新：

OTUnumber = []

import re

with open('all.16S.uniq.txt','r') as f_in:
    data = f_in.read()
    for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
        x=g.split('\t') #list containing all sample names

with open('all.16S.uniq.txt','r') as file:
     for i,line in enumerate(file): 
        columns = line.strip().split('\t')[x]
        if i == 0:
            samples = columns2 
            OTUnumber = [0 for s in samples] #
        else:
            for n,v in enumerate(columns):
                if int(v) > 0:
                    OTUnumber[n] = OTUnumber[n] + 1
                else:
                    continue

result = dict(zip(samples,OTUnumber))

returns error: TypeError: list indices must be integers or slices, not list 返回错误： TypeError: list indices must be integers or slices, not list

Answer 1

You could achieve this with simple regex (with flags set to re.MULTILINE ): 您可以使用简单的正则表达式（将标志设置为re.MULTILINE ）来re.MULTILINE ：

import re

data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''

for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
    print(g.split('\t'))

Prints: 打印：

['sample11', 'sample12', 'sample13']
['sample21', 'sample22', 'sample23', 'sample24']
['sample31', 'sample32']

Edit (to read from file): 编辑（从文件读取）：

import re

with open('all.16S.uniq.txt','r') as f_in:
    data = f_in.read()
    for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
        print(g.split('\t'))

Edit2: to get index of last column that contains sample: Edit2：获取包含样本的最后一列的索引：

import re

data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''

for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
    print('Index of last column is:', 11 + len(g.split('\t')))

Prints: 打印：

Index of last column is: 14
Index of last column is: 15
Index of last column is: 13

Answer 2

This is one approach using a custom function 这是使用自定义函数的一种方法

Ex: 例如：

def get_last_sample_index(columns):
    for ind, c in enumerate(reversed(columns), 1):  #Reverse columns
        if c.startswith("sample"):                  #Get last column with `sample`
            return ind
    return -1

with open('all.16S.uniq.txt','r') as file:
    for i,line in enumerate(file):
        columns = line.strip().split('\t')
        columns = columns[11:-get_last_sample_index(columns)+1]

如何在Python列表中基于带有特定正则表达式的最后一次出现选择最后一个值/索引？

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-06-25 07:26:23

解决方案2
0 2019-06-25 07:58:09

如何在Python列表中基于带有特定正则表达式的最后一次出现选择最后一个值/索引？

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-06-25 07:26:23

解决方案2 0 2019-06-25 07:58:09

解决方案1
1 已采纳 2019-06-25 07:26:23

解决方案2
0 2019-06-25 07:58:09