[英]How to select the last value/index based on last occurence with a certain regular expression in a list in Python?
I'm performing certain calculations on a large .txt(tab delimited, 300+ columns, 1 000 000+ rows) file using following code: 我正在使用以下代码在大型.txt(制表符分隔,300 +列,1000000 +行)文件上执行某些计算:
samples = []
OTUnumber = []
with open('all.16S.uniq.txt','r') as file:
for i,line in enumerate(file):
columns = line.strip().split('\t')[11:353]
if i == 0: #headers are sample names so first row
samples = columns #save sample names
OTUnumbers = [0 for s in samples] #set starting value as zero
else:
for n,v in enumerate(columns):
if v > 0:
OTUnumber[n] = OTUnumber[n] + 1
else:
continue
result = dict(zip(samples,OTUnumbers))
I'm having a question about a certain part of this code. 我对此代码的某些部分有疑问。 Code of interest: 感兴趣的代码:
columns = line.strip().split('\t')[11:353] ###row i is splitted and saved as a list
The .txt file has a lot of columns and I'm only interested in part of the columns. .txt文件有很多列,我只对部分列感兴趣。 I frequently generate these kind of .txt files and the columns of interest always start at index 11 but do not always end at index 353. The last columns are never columns of interest. 我经常生成此类.txt文件,并且感兴趣的列始终始于索引11,但并不总是结束于索引353。最后一列永远不会是感兴趣的列。 I want to "automate" this code so that Python performs the code on the columns of interest. 我想“自动化”此代码,以便Python在感兴趣的列上执行代码。
The name of all columns of interest start with "sample". 所有感兴趣的列的名称均以“ sample”开头。 So basically I want to select the last column with the regular expression "sample". 因此,基本上我想选择带有正则表达式“ sample”的最后一列。 Mind that I read a line of the file, split it, and then save it as a list (= columns
) Code I'm looking for : 请注意,我读取了文件的一行,将其拆分,然后将其另存为列表(= columns
), 我正在寻找以下代码 :
columns = line.strip().split('\t')[11:```LAST COLUMN WHICH STARTS WITH "sample"```]
Based upon some research on the web I tried following code, but it returns a SyntaxError. 基于对网络的一些研究,我尝试了以下代码,但它返回了SyntaxError。
columns = line.strip().split('\t') 11:columns.where(columns==^[sample]).last_valid_index()]
Any ideas how to write this code? 任何想法如何编写此代码?
UPDATE: 更新:
OTUnumber = []
import re
with open('all.16S.uniq.txt','r') as f_in:
data = f_in.read()
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
x=g.split('\t') #list containing all sample names
with open('all.16S.uniq.txt','r') as file:
for i,line in enumerate(file):
columns = line.strip().split('\t')[x]
if i == 0:
samples = columns2
OTUnumber = [0 for s in samples] #
else:
for n,v in enumerate(columns):
if int(v) > 0:
OTUnumber[n] = OTUnumber[n] + 1
else:
continue
result = dict(zip(samples,OTUnumber))
returns error: TypeError: list indices must be integers or slices, not list
返回错误: TypeError: list indices must be integers or slices, not list
You could achieve this with simple regex (with flags set to re.MULTILINE
): 您可以使用简单的正则表达式(将标志设置为re.MULTILINE
)来re.MULTILINE
:
import re
data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
print(g.split('\t'))
Prints: 打印:
['sample11', 'sample12', 'sample13']
['sample21', 'sample22', 'sample23', 'sample24']
['sample31', 'sample32']
Edit (to read from file): 编辑(从文件读取):
import re
with open('all.16S.uniq.txt','r') as f_in:
data = f_in.read()
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
print(g.split('\t'))
Edit2: to get index of last column that contains sample: Edit2:获取包含样本的最后一列的索引:
import re
data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
print('Index of last column is:', 11 + len(g.split('\t')))
Prints: 打印:
Index of last column is: 14
Index of last column is: 15
Index of last column is: 13
This is one approach using a custom function 这是使用自定义函数的一种方法
Ex: 例如:
def get_last_sample_index(columns):
for ind, c in enumerate(reversed(columns), 1): #Reverse columns
if c.startswith("sample"): #Get last column with `sample`
return ind
return -1
with open('all.16S.uniq.txt','r') as file:
for i,line in enumerate(file):
columns = line.strip().split('\t')
columns = columns[11:-get_last_sample_index(columns)+1]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.