[英]Printing numbered nested lists on separate lines using enumerate in python
[英]Printing previous line in python using enumerate
我有一個以下格式的文件。
OperonID GI Synonym Start End Strand Length COG_number Product
1132034 397671780 RVBD_0002 2052 3260 + 402 - DNA polymerase III subunit beta
1132034 397671781 RVBD_0003 3280 4437 + 385 - DNA replication and repair protein RecF
1132034 397671782 RVBD_0004 4434 4997 + 187 - hypothetical protein
1132035 397671783 RVBD_0005 5123 7267 + 714 - DNA gyrase subunit B
1132035 397671784 RVBD_0006 7302 9818 + 838 - DNA gyrase subunit A
1132036 397671786 RVBD_0007Ac 11421 11528 - 35 - hypothetical protein
1132036 397671787 RVBD_0007Bc 11555 11692 - 45 - hypothetical protein
1132037 397671792 RVBD_0012 14089 14877 + 262 - hypothetical protein
我知道到目前為止我可能可以使用 enumerate 並擁有以下腳本。
lines = open('operonmap.opr', 'r').read().splitlines()
operon_id = 1132034
start = ''
end = ''
strand = ''
for i,line in enumerate(lines):
if str(operon_id) in line:
start += line[28:33]
else:
end += line[i-1]
operonline += start
operonline += end
operonline += '\n'
然后,如果這種腳本有效,我將編輯字符串“operonline”以僅包含起始端和鏈信息。 不幸的是它不起作用,但我希望你能看到我的邏輯。
我希望有人能夠提供幫助!
這是一個可能的實現。 parse_file
包含以下變量:
this_info
:包含與當前行相關的信息的字典
previous_info
:來自上一次迭代的this_info
start_info
: this_info
來自作為新操作子 ID 開始的最近行
所需的 output 並不完全清楚,但調整主程序(最后)以您選擇的任何形式寫入提取的字段。
def parse_file(input_file):
"""
reads an opr file, returns a list of dictionaries with info about the operon ids
"""
results = []
start_info = previous_info = {}
with open(input_file) as f:
next(f) # ignore first line
for line in f:
bits = line.split()
# dictionary containing information extracted from a
# particular line
this_info = {'operon_id': int(bits[0]),
'start': int(bits[3]),
'end': int(bits[4]),
'strand': bits[5]}
if not previous_info:
# first line of file
start_info = this_info
elif previous_info['operon_id'] != this_info['operon_id']:
# this is the first line with NEW Operon ID,
# so add result for previous Operon ID,
# of which the end line was the PREVIOUS line
_add_result(results, start_info, previous_info)
start_info = this_info # start line for this ID
# also adding a sanity check here - the strand
# should be the same for every line of a given
# operon ID
if start_info["strand"] != this_info["strand"]:
print("warning, strand info inconsistent")
previous_info = this_info # ready for next iteration
_add_result(results, start_info, this_info) # last ID
return results
def _add_result(results, start_info, end_info):
"""
add to the results a dictionary based on start line info
but with end line info used for the 'end' field
"""
info = start_info.copy()
info['end'] = end_info['end']
results.append(info)
for result in parse_file('operonmap.opr'):
# write out some info
print(result['operon_id'],
result['start'],
result['end'],
result['strand'])
這給出了:
1132034 2052 4997 +
1132035 5123 9818 +
1132036 11421 11692 -
1132037 14089 14877 +
如果您使用 pandas,這很容易,如果您想使用 go 這條路線..
我能夠將您的數據讀入pandas DataFrame
然后刪除其他列:
Start End Strand OperonID
0 2052 3260 + 1132034
1 3280 4437 + 1132034
2 4434 4997 + 1132034
3 5123 7267 + 1132035
4 7302 9818 + 1132035
5 11421 11528 - 1132036
6 11555 11692 - 1132036
7 14089 14877 + 1132037
然后我按OperonID
分組並將Start
和End
以及Strand
值存儲為列表,並創建了一個新列,其中包含每個OperonID
的第一個Start
和最后一個End
以及唯一的Strand
值..您可以以任何您認為合適的方式重新組織它
df2 = df.groupby('OperonID')[['Start', 'End', 'Strand']].agg(list)
df2['result'] = df2.apply(lambda x: (x['Start'][0], x['End'][-1], set(x['Strand'])), axis=1)
df2['result']:
OperonID
1132034 (2052, 4997, {+})
1132035 (5123, 9818, {+})
1132036 (11421, 11692, {-})
1132037 (14089, 14877, {+})
也許試試這樣的邏輯? 它只有一個臨時變量來跟蹤您看到的最后一個 OperonID,並在更改后切換開始/結束:
In [21]: lines = open("test.csv").read().splitlines()
In [22]: lines
Out[22]:
['OperonID,GI,Synonym,Start,End,Strand,Length',
'1132034,397671780,RVBD_0002,2052,3260,+,402',
'1132034,397671781,RVBD_0003,3280,4437,+,385',
'1132034,397671782,RVBD_0004,4434,4997,+,187',
'1132035,397671783,RVBD_0005,5123,7267,+,714',
'1132035,397671784,RVBD_0006,7302,9818,+,838',
'1132036,397671786,RVBD_0007Ac,11421,11528,-,35',
'1132036,397671787,RVBD_0007Bc,11555,11692,-,45',
'1132037,397671792,RVBD_0012,14089,14877,+,262']
In [23]: cur_operonid = ''
In [24]: cur_end = None
In [27]: cur_start = None
...: for line in lines[1:]:
...: cols = line.split(','). # or line.split('\t') for tab-delimit
...: if cur_operonid != cols[0]: # New OperonID reached
...: if cur_start is not None:
...: print(f"{cur_operonid} went from {cur_start} to {cur_end}")
...: cur_operonid = cols[0]
...: cur_start = cols[3]
...: else:
...: cur_end = cols[4]
...:
1132034 went from 2052 to 4997
1132035 went from 5123 to 9818
1132036 went from 11421 to 11692
operon_id = '1132034'
start = ''
end = ''
strand = ''
all_data = list()
with open("operonmap.opr", "r") as f:
lines = [line.split() for line in f.readlines()]
body = lines[1:]
for line in body:
OperonID, GI, Synonym, Start, End, Strand, Length, *COG_number, Product = line
data = dict()
data["OperonID"] = OperonID
data["GI"] = GI
data["Synonym"] = Synonym
data["Start"] = Start
data["End"] = End
data["Strand"] = Strand
data["Length"] = Length
data["COG_number"] = COG_number
data["Product"] = Product
all_data.append(data)
for data in all_data:
if data["OperonID"] == operon_id:
start, end, strand = data["Start"], data["End"], data["Strand"]
print("Start\t", start)
print("End\t", end)
print("Strand\t", strand)
OUTPUT
Start 2052
End 3260
Strand +
Start 3280
End 4437
Strand +
Start 4434
End 4997
Strand +
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.