[英]Printing previous line in python using enumerate
I have a file in the below format.我有一个以下格式的文件。
OperonID GI Synonym Start End Strand Length COG_number Product
1132034 397671780 RVBD_0002 2052 3260 + 402 - DNA polymerase III subunit beta
1132034 397671781 RVBD_0003 3280 4437 + 385 - DNA replication and repair protein RecF
1132034 397671782 RVBD_0004 4434 4997 + 187 - hypothetical protein
1132035 397671783 RVBD_0005 5123 7267 + 714 - DNA gyrase subunit B
1132035 397671784 RVBD_0006 7302 9818 + 838 - DNA gyrase subunit A
1132036 397671786 RVBD_0007Ac 11421 11528 - 35 - hypothetical protein
1132036 397671787 RVBD_0007Bc 11555 11692 - 45 - hypothetical protein
1132037 397671792 RVBD_0012 14089 14877 + 262 - hypothetical protein
I know I can probably use enumerate and have the following script so far.我知道到目前为止我可能可以使用 enumerate 并拥有以下脚本。
lines = open('operonmap.opr', 'r').read().splitlines()
operon_id = 1132034
start = ''
end = ''
strand = ''
for i,line in enumerate(lines):
if str(operon_id) in line:
start += line[28:33]
else:
end += line[i-1]
operonline += start
operonline += end
operonline += '\n'
I would then edit the string 'operonline' to include only the start end and strand information if this sort of script worked.然后,如果这种脚本有效,我将编辑字符串“operonline”以仅包含起始端和链信息。 Unfortunately it doesn't work, but I hope you can see my logic.不幸的是它不起作用,但我希望你能看到我的逻辑。
I hope someone's able to help !我希望有人能够提供帮助!
Here is a possible implementation.这是一个可能的实现。 parse_file
contains the following variables: parse_file
包含以下变量:
this_info
: dictionary containing info relating to the current line this_info
:包含与当前行相关的信息的字典
previous_info
: this_info
from previous iteration previous_info
:来自上一次迭代的this_info
start_info
: this_info
from the most recent line that was the start of a new operon ID start_info
: this_info
来自作为新操作子 ID 开始的最近行
The desired output is not exactly clear, but adjust the main program (at the end) to write the extracted fields in any form you choose.所需的 output 并不完全清楚,但调整主程序(最后)以您选择的任何形式写入提取的字段。
def parse_file(input_file):
"""
reads an opr file, returns a list of dictionaries with info about the operon ids
"""
results = []
start_info = previous_info = {}
with open(input_file) as f:
next(f) # ignore first line
for line in f:
bits = line.split()
# dictionary containing information extracted from a
# particular line
this_info = {'operon_id': int(bits[0]),
'start': int(bits[3]),
'end': int(bits[4]),
'strand': bits[5]}
if not previous_info:
# first line of file
start_info = this_info
elif previous_info['operon_id'] != this_info['operon_id']:
# this is the first line with NEW Operon ID,
# so add result for previous Operon ID,
# of which the end line was the PREVIOUS line
_add_result(results, start_info, previous_info)
start_info = this_info # start line for this ID
# also adding a sanity check here - the strand
# should be the same for every line of a given
# operon ID
if start_info["strand"] != this_info["strand"]:
print("warning, strand info inconsistent")
previous_info = this_info # ready for next iteration
_add_result(results, start_info, this_info) # last ID
return results
def _add_result(results, start_info, end_info):
"""
add to the results a dictionary based on start line info
but with end line info used for the 'end' field
"""
info = start_info.copy()
info['end'] = end_info['end']
results.append(info)
for result in parse_file('operonmap.opr'):
# write out some info
print(result['operon_id'],
result['start'],
result['end'],
result['strand'])
This gives:这给出了:
1132034 2052 4997 +
1132035 5123 9818 +
1132036 11421 11692 -
1132037 14089 14877 +
this is pretty easy if you use pandas, if you want to go that route..如果您使用 pandas,这很容易,如果您想使用 go 这条路线..
I was able to read your data into a pandas DataFrame
then removed the other columns:我能够将您的数据读入pandas DataFrame
然后删除其他列:
Start End Strand OperonID
0 2052 3260 + 1132034
1 3280 4437 + 1132034
2 4434 4997 + 1132034
3 5123 7267 + 1132035
4 7302 9818 + 1132035
5 11421 11528 - 1132036
6 11555 11692 - 1132036
7 14089 14877 + 1132037
then I grouped by OperonID
and stored the Start
and End
and Strand
values as lists, and made a new column with the first Start
and last End
per OperonID
and the unique Strand
values..you could reorganize this anyway you see fit然后我按OperonID
分组并将Start
和End
以及Strand
值存储为列表,并创建了一个新列,其中包含每个OperonID
的第一个Start
和最后一个End
以及唯一的Strand
值..您可以以任何您认为合适的方式重新组织它
df2 = df.groupby('OperonID')[['Start', 'End', 'Strand']].agg(list)
df2['result'] = df2.apply(lambda x: (x['Start'][0], x['End'][-1], set(x['Strand'])), axis=1)
df2['result']:
OperonID
1132034 (2052, 4997, {+})
1132035 (5123, 9818, {+})
1132036 (11421, 11692, {-})
1132037 (14089, 14877, {+})
Maybe try something like this logic?也许试试这样的逻辑? It just has a temp variable keeping track of the last OperonID you've seen, and switches the start/end once that changes:它只有一个临时变量来跟踪您看到的最后一个 OperonID,并在更改后切换开始/结束:
In [21]: lines = open("test.csv").read().splitlines()
In [22]: lines
Out[22]:
['OperonID,GI,Synonym,Start,End,Strand,Length',
'1132034,397671780,RVBD_0002,2052,3260,+,402',
'1132034,397671781,RVBD_0003,3280,4437,+,385',
'1132034,397671782,RVBD_0004,4434,4997,+,187',
'1132035,397671783,RVBD_0005,5123,7267,+,714',
'1132035,397671784,RVBD_0006,7302,9818,+,838',
'1132036,397671786,RVBD_0007Ac,11421,11528,-,35',
'1132036,397671787,RVBD_0007Bc,11555,11692,-,45',
'1132037,397671792,RVBD_0012,14089,14877,+,262']
In [23]: cur_operonid = ''
In [24]: cur_end = None
In [27]: cur_start = None
...: for line in lines[1:]:
...: cols = line.split(','). # or line.split('\t') for tab-delimit
...: if cur_operonid != cols[0]: # New OperonID reached
...: if cur_start is not None:
...: print(f"{cur_operonid} went from {cur_start} to {cur_end}")
...: cur_operonid = cols[0]
...: cur_start = cols[3]
...: else:
...: cur_end = cols[4]
...:
1132034 went from 2052 to 4997
1132035 went from 5123 to 9818
1132036 went from 11421 to 11692
operon_id = '1132034'
start = ''
end = ''
strand = ''
all_data = list()
with open("operonmap.opr", "r") as f:
lines = [line.split() for line in f.readlines()]
body = lines[1:]
for line in body:
OperonID, GI, Synonym, Start, End, Strand, Length, *COG_number, Product = line
data = dict()
data["OperonID"] = OperonID
data["GI"] = GI
data["Synonym"] = Synonym
data["Start"] = Start
data["End"] = End
data["Strand"] = Strand
data["Length"] = Length
data["COG_number"] = COG_number
data["Product"] = Product
all_data.append(data)
for data in all_data:
if data["OperonID"] == operon_id:
start, end, strand = data["Start"], data["End"], data["Strand"]
print("Start\t", start)
print("End\t", end)
print("Strand\t", strand)
OUTPUT OUTPUT
Start 2052
End 3260
Strand +
Start 3280
End 4437
Strand +
Start 4434
End 4997
Strand +
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.