![](/img/trans.png)
[英]How to find last occurence index matching a certain value in a Pandas Series?
[英]How to select the last value/index based on last occurence with a certain regular expression in a list in Python?
我正在使用以下代碼在大型.txt(制表符分隔,300 +列,1000000 +行)文件上執行某些計算:
samples = []
OTUnumber = []
with open('all.16S.uniq.txt','r') as file:
for i,line in enumerate(file):
columns = line.strip().split('\t')[11:353]
if i == 0: #headers are sample names so first row
samples = columns #save sample names
OTUnumbers = [0 for s in samples] #set starting value as zero
else:
for n,v in enumerate(columns):
if v > 0:
OTUnumber[n] = OTUnumber[n] + 1
else:
continue
result = dict(zip(samples,OTUnumbers))
我對此代碼的某些部分有疑問。 感興趣的代碼:
columns = line.strip().split('\t')[11:353] ###row i is splitted and saved as a list
.txt文件有很多列,我只對部分列感興趣。 我經常生成此類.txt文件,並且感興趣的列始終始於索引11,但並不總是結束於索引353。最后一列永遠不會是感興趣的列。 我想“自動化”此代碼,以便Python在感興趣的列上執行代碼。
所有感興趣的列的名稱均以“ sample”開頭。 因此,基本上我想選擇帶有正則表達式“ sample”的最后一列。 請注意,我讀取了文件的一行,將其拆分,然后將其另存為列表(= columns
), 我正在尋找以下代碼 :
columns = line.strip().split('\t')[11:```LAST COLUMN WHICH STARTS WITH "sample"```]
基於對網絡的一些研究,我嘗試了以下代碼,但它返回了SyntaxError。
columns = line.strip().split('\t') 11:columns.where(columns==^[sample]).last_valid_index()]
任何想法如何編寫此代碼?
更新:
OTUnumber = []
import re
with open('all.16S.uniq.txt','r') as f_in:
data = f_in.read()
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
x=g.split('\t') #list containing all sample names
with open('all.16S.uniq.txt','r') as file:
for i,line in enumerate(file):
columns = line.strip().split('\t')[x]
if i == 0:
samples = columns2
OTUnumber = [0 for s in samples] #
else:
for n,v in enumerate(columns):
if int(v) > 0:
OTUnumber[n] = OTUnumber[n] + 1
else:
continue
result = dict(zip(samples,OTUnumber))
返回錯誤: TypeError: list indices must be integers or slices, not list
您可以使用簡單的正則表達式(將標志設置為re.MULTILINE
)來re.MULTILINE
:
import re
data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
print(g.split('\t'))
打印:
['sample11', 'sample12', 'sample13']
['sample21', 'sample22', 'sample23', 'sample24']
['sample31', 'sample32']
編輯(從文件讀取):
import re
with open('all.16S.uniq.txt','r') as f_in:
data = f_in.read()
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
print(g.split('\t'))
Edit2:獲取包含樣本的最后一列的索引:
import re
data = '''
header 1\theader 2\theader 3\theader 4\theader 5\theader 6\theader 7\theader 8\theader 10\theader 11\theader 12\theader 13\theader 14
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample11\tsample12\tsample13\tc3\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample21\tsample22\tsample23\tsample24\tc4
c1\tc2\tc3\tc4\tc5\tc6\tc7\tc8\tc9\tc 10\tc 11\tsample31\tsample32\tc3
'''
for g in re.findall(r'^(?:[^\t]+\t){11}(.*(?:sample[^\t]+)).*$', data, flags=re.M):
print('Index of last column is:', 11 + len(g.split('\t')))
打印:
Index of last column is: 14
Index of last column is: 15
Index of last column is: 13
這是使用自定義函數的一種方法
例如:
def get_last_sample_index(columns):
for ind, c in enumerate(reversed(columns), 1): #Reverse columns
if c.startswith("sample"): #Get last column with `sample`
return ind
return -1
with open('all.16S.uniq.txt','r') as file:
for i,line in enumerate(file):
columns = line.strip().split('\t')
columns = columns[11:-get_last_sample_index(columns)+1]
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.