简体   繁体   中英

How to select the last value/index based on last occurence with a certain regular expression in a list in Python?

I'm performing certain calculations on a large .txt(tab delimited, 300+ columns, 1 000 000+ rows) file using following code:

samples = []
OTUnumber = []

with open('all.16S.uniq.txt','r') as file:
     for i,line in enumerate(file): 
        columns = line.strip().split('\t')[11:353]
        if i == 0: #headers are sample names so first row
            samples = columns #save sample names 
            OTUnumbers = [0 for s in samples] #set starting value as zero
        else:
            for n,v in enumerate(columns):
                if v > 0:
                    OTUnumber[n] = OTUnumber[n] + 1
                else:
                    continue

result = dict(zip(samples,OTUnumbers))

I'm having a question about a certain part of this code. Code of interest:

columns = line.strip().split('\t')[11:353] ###row i is splitted and saved as a list

The .txt file has a lot of columns and I'm only interested in part of the columns. I frequently generate these kind of .txt files and the columns of interest always start at index 11 but do not always end at index 353. The last columns are never columns of interest. I want to "automate" this code so that Python performs the code on the columns of interest.

The name of all columns of interest start with "sample". So basically I want to select the last column with the regular expression "sample". Mind that I read a line of the file, split it, and then save it as a list (= columns ) Code I'm looking for :

columns = line.strip().split('\t')[11:```LAST COLUMN WHICH STARTS WITH "sample"```]

Based upon some research on the web I tried following code, but it returns a SyntaxError.

columns = line.strip().split('\t') 11:columns.where(columns==^[sample]).last_valid_index()]

Any ideas how to write this code?

UPDATE:

OTUnumber = []

import re

with open('all.16S.uniq.txt','r') as f_in:
    data = f_in.read()
    for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
        x=g.split('\t') #list containing all sample names

with open('all.16S.uniq.txt','r') as file:
     for i,line in enumerate(file): 
        columns = line.strip().split('\t')[x]
        if i == 0:
            samples = columns2 
            OTUnumber = [0 for s in samples] #
        else:
            for n,v in enumerate(columns):
                if int(v) > 0:
                    OTUnumber[n] = OTUnumber[n] + 1
                else:
                    continue

result = dict(zip(samples,OTUnumber))

returns error: TypeError: list indices must be integers or slices, not list

You could achieve this with simple regex (with flags set to re.MULTILINE ):

import re

data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''

for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
    print(g.split('\t'))

Prints:

['sample11', 'sample12', 'sample13']
['sample21', 'sample22', 'sample23', 'sample24']
['sample31', 'sample32']

Edit (to read from file):

import re

with open('all.16S.uniq.txt','r') as f_in:
    data = f_in.read()
    for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
        print(g.split('\t'))

Edit2: to get index of last column that contains sample:

import re

data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''

for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
    print('Index of last column is:', 11 + len(g.split('\t')))

Prints:

Index of last column is: 14
Index of last column is: 15
Index of last column is: 13

This is one approach using a custom function

Ex:

def get_last_sample_index(columns):
    for ind, c in enumerate(reversed(columns), 1):  #Reverse columns
        if c.startswith("sample"):                  #Get last column with `sample`
            return ind
    return -1

with open('all.16S.uniq.txt','r') as file:
    for i,line in enumerate(file):
        columns = line.strip().split('\t')
        columns = columns[11:-get_last_sample_index(columns)+1]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM