如何在Python列表中基於帶有特定正則表達式的最后一次出現選擇最后一個值/索引？

Question

我正在使用以下代碼在大型.txt（制表符分隔，300 +列，1000000 +行）文件上執行某些計算：

samples = []
OTUnumber = []

with open('all.16S.uniq.txt','r') as file:
     for i,line in enumerate(file): 
        columns = line.strip().split('\t')[11:353]
        if i == 0: #headers are sample names so first row
            samples = columns #save sample names 
            OTUnumbers = [0 for s in samples] #set starting value as zero
        else:
            for n,v in enumerate(columns):
                if v > 0:
                    OTUnumber[n] = OTUnumber[n] + 1
                else:
                    continue

result = dict(zip(samples,OTUnumbers))

我對此代碼的某些部分有疑問。 感興趣的代碼：

columns = line.strip().split('\t')[11:353] ###row i is splitted and saved as a list

.txt文件有很多列，我只對部分列感興趣。 我經常生成此類.txt文件，並且感興趣的列始終始於索引11，但並不總是結束於索引353。最后一列永遠不會是感興趣的列。 我想“自動化”此代碼，以便Python在感興趣的列上執行代碼。

所有感興趣的列的名稱均以“ sample”開頭。 因此，基本上我想選擇帶有正則表達式“ sample”的最后一列。 請注意，我讀取了文件的一行，將其拆分，然后將其另存為列表（= columns ）， 我正在尋找以下代碼 ：

columns = line.strip().split('\t')[11:```LAST COLUMN WHICH STARTS WITH "sample"```]

基於對網絡的一些研究，我嘗試了以下代碼，但它返回了SyntaxError。

columns = line.strip().split('\t') 11:columns.where(columns==^[sample]).last_valid_index()]

任何想法如何編寫此代碼？

更新：

OTUnumber = []

import re

with open('all.16S.uniq.txt','r') as f_in:
    data = f_in.read()
    for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
        x=g.split('\t') #list containing all sample names

with open('all.16S.uniq.txt','r') as file:
     for i,line in enumerate(file): 
        columns = line.strip().split('\t')[x]
        if i == 0:
            samples = columns2 
            OTUnumber = [0 for s in samples] #
        else:
            for n,v in enumerate(columns):
                if int(v) > 0:
                    OTUnumber[n] = OTUnumber[n] + 1
                else:
                    continue

result = dict(zip(samples,OTUnumber))

返回錯誤： TypeError: list indices must be integers or slices, not list

Answer 1

您可以使用簡單的正則表達式（將標志設置為re.MULTILINE ）來re.MULTILINE ：

import re

data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''

for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
    print(g.split('\t'))

打印：

['sample11', 'sample12', 'sample13']
['sample21', 'sample22', 'sample23', 'sample24']
['sample31', 'sample32']

編輯（從文件讀取）：

import re

with open('all.16S.uniq.txt','r') as f_in:
    data = f_in.read()
    for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
        print(g.split('\t'))

Edit2：獲取包含樣本的最后一列的索引：

import re

data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''

for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
    print('Index of last column is:', 11 + len(g.split('\t')))

打印：

Index of last column is: 14
Index of last column is: 15
Index of last column is: 13

Answer 2

這是使用自定義函數的一種方法

例如：

def get_last_sample_index(columns):
    for ind, c in enumerate(reversed(columns), 1):  #Reverse columns
        if c.startswith("sample"):                  #Get last column with `sample`
            return ind
    return -1

with open('all.16S.uniq.txt','r') as file:
    for i,line in enumerate(file):
        columns = line.strip().split('\t')
        columns = columns[11:-get_last_sample_index(columns)+1]

如何在Python列表中基於帶有特定正則表達式的最后一次出現選擇最后一個值/索引？

問題描述

2 個解決方案

解決方案1
1 已采納 2019-06-25 07:26:23

解決方案2
0 2019-06-25 07:58:09

如何在Python列表中基於帶有特定正則表達式的最后一次出現選擇最后一個值/索引？

問題描述

2 個解決方案

解決方案1 1 已采納 2019-06-25 07:26:23

解決方案2 0 2019-06-25 07:58:09

解決方案1
1 已采納 2019-06-25 07:26:23

解決方案2
0 2019-06-25 07:58:09