I'm performing certain calculations on a large .txt(tab delimited, 300+ columns, 1 000 000+ rows) file using following code:
samples = []
OTUnumber = []
with open('all.16S.uniq.txt','r') as file:
for i,line in enumerate(file):
columns = line.strip().split('\t')[11:353]
if i == 0: #headers are sample names so first row
samples = columns #save sample names
OTUnumbers = [0 for s in samples] #set starting value as zero
else:
for n,v in enumerate(columns):
if v > 0:
OTUnumber[n] = OTUnumber[n] + 1
else:
continue
result = dict(zip(samples,OTUnumbers))
I'm having a question about a certain part of this code. Code of interest:
columns = line.strip().split('\t')[11:353] ###row i is splitted and saved as a list
The .txt file has a lot of columns and I'm only interested in part of the columns. I frequently generate these kind of .txt files and the columns of interest always start at index 11 but do not always end at index 353. The last columns are never columns of interest. I want to "automate" this code so that Python performs the code on the columns of interest.
The name of all columns of interest start with "sample". So basically I want to select the last column with the regular expression "sample". Mind that I read a line of the file, split it, and then save it as a list (= columns
) Code I'm looking for :
columns = line.strip().split('\t')[11:```LAST COLUMN WHICH STARTS WITH "sample"```]
Based upon some research on the web I tried following code, but it returns a SyntaxError.
columns = line.strip().split('\t') 11:columns.where(columns==^[sample]).last_valid_index()]
Any ideas how to write this code?
UPDATE:
OTUnumber = []
import re
with open('all.16S.uniq.txt','r') as f_in:
data = f_in.read()
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
x=g.split('\t') #list containing all sample names
with open('all.16S.uniq.txt','r') as file:
for i,line in enumerate(file):
columns = line.strip().split('\t')[x]
if i == 0:
samples = columns2
OTUnumber = [0 for s in samples] #
else:
for n,v in enumerate(columns):
if int(v) > 0:
OTUnumber[n] = OTUnumber[n] + 1
else:
continue
result = dict(zip(samples,OTUnumber))
returns error: TypeError: list indices must be integers or slices, not list
You could achieve this with simple regex (with flags set to re.MULTILINE
):
import re
data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
print(g.split('\t'))
Prints:
['sample11', 'sample12', 'sample13']
['sample21', 'sample22', 'sample23', 'sample24']
['sample31', 'sample32']
Edit (to read from file):
import re
with open('all.16S.uniq.txt','r') as f_in:
data = f_in.read()
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
print(g.split('\t'))
Edit2: to get index of last column that contains sample:
import re
data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
print('Index of last column is:', 11 + len(g.split('\t')))
Prints:
Index of last column is: 14
Index of last column is: 15
Index of last column is: 13
This is one approach using a custom function
Ex:
def get_last_sample_index(columns):
for ind, c in enumerate(reversed(columns), 1): #Reverse columns
if c.startswith("sample"): #Get last column with `sample`
return ind
return -1
with open('all.16S.uniq.txt','r') as file:
for i,line in enumerate(file):
columns = line.strip().split('\t')
columns = columns[11:-get_last_sample_index(columns)+1]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.