繁体   English   中英

使用枚举在 python 中打印上一行

[英]Printing previous line in python using enumerate

我有一个以下格式的文件。

OperonID    GI      Synonym    Start    End  Strand Length  COG_number  Product
1132034 397671780   RVBD_0002   2052    3260    +   402 -   DNA polymerase III subunit beta
1132034 397671781   RVBD_0003   3280    4437    +   385 -   DNA replication and repair protein RecF
1132034 397671782   RVBD_0004   4434    4997    +   187 -   hypothetical protein
1132035 397671783   RVBD_0005   5123    7267    +   714 -   DNA gyrase subunit B
1132035 397671784   RVBD_0006   7302    9818    +   838 -   DNA gyrase subunit A
1132036 397671786   RVBD_0007Ac 11421   11528   -   35  -   hypothetical protein
1132036 397671787   RVBD_0007Bc 11555   11692   -   45  -   hypothetical protein
1132037 397671792   RVBD_0012   14089   14877   +   262 -   hypothetical protein
  • 我需要每个 Operon ID 的开始和结束坐标以及它自己的文件/字符串中的链。 例如,对于操作子 1132034,起始坐标为 2052,结束坐标为 4997,链为 -。

我知道到目前为止我可能可以使用 enumerate 并拥有以下脚本。

lines = open('operonmap.opr', 'r').read().splitlines()
operon_id = 1132034
start = ''
end = ''
strand = ''

for i,line in enumerate(lines):
      if str(operon_id) in line:
            start += line[28:33]
      else:
            end += line[i-1]
            operonline += start
            operonline += end
            operonline += '\n'

然后,如果这种脚本有效,我将编辑字符串“operonline”以仅包含起始端和链信息。 不幸的是它不起作用,但我希望你能看到我的逻辑。

我希望有人能够提供帮助!

这是一个可能的实现。 parse_file包含以下变量:

  • this_info :包含与当前行相关的信息的字典

  • previous_info :来自上一次迭代的this_info

  • start_info : this_info来自作为新操作子 ID 开始的最近行

所需的 output 并不完全清楚,但调整主程序(最后)以您选择的任何形式写入提取的字段。

def parse_file(input_file):
    """
    reads an opr file, returns a list of dictionaries with info about the operon ids
    """
    results = []
    start_info = previous_info = {}
    with open(input_file) as f:
        next(f)  # ignore first line
        for line in f:
            bits = line.split()

            # dictionary containing information extracted from a
            # particular line
            this_info = {'operon_id': int(bits[0]),
                         'start': int(bits[3]),
                         'end': int(bits[4]),
                         'strand': bits[5]}

            if not previous_info:
                # first line of file
                start_info = this_info

            elif previous_info['operon_id'] != this_info['operon_id']:
                # this is the first line with NEW Operon ID,
                # so add result for previous Operon ID,  
                # of which the end line was the PREVIOUS line
                _add_result(results, start_info, previous_info)
                start_info = this_info  # start line for this ID

            # also adding a sanity check here - the strand
            # should be the same for every line of a given
            # operon ID
            if start_info["strand"] != this_info["strand"]:
                print("warning, strand info inconsistent")

            previous_info = this_info  # ready for next iteration

        _add_result(results, start_info, this_info)  # last ID

    return results


def _add_result(results, start_info, end_info):
    """
    add to the results a dictionary based on start line info
    but with end line info used for the 'end' field
    """
    info = start_info.copy()
    info['end'] = end_info['end']
    results.append(info)


for result in parse_file('operonmap.opr'):
    # write out some info
    print(result['operon_id'],
          result['start'],
          result['end'],
          result['strand'])

这给出了:

1132034 2052 4997 +
1132035 5123 9818 +
1132036 11421 11692 -
1132037 14089 14877 +

如果您使用 pandas,这很容易,如果您想使用 go 这条路线..

我能够将您的数据读入pandas DataFrame然后删除其他列:

   Start    End Strand OperonID
0   2052   3260      +  1132034
1   3280   4437      +  1132034
2   4434   4997      +  1132034
3   5123   7267      +  1132035
4   7302   9818      +  1132035
5  11421  11528      -  1132036
6  11555  11692      -  1132036
7  14089  14877      +  1132037

然后我按OperonID分组并将StartEnd以及Strand值存储为列表,并创建了一个新列,其中包含每个OperonID的第一个Start和最后一个End以及唯一的Strand值..您可以以任何您认为合适的方式重新组织它

df2 = df.groupby('OperonID')[['Start', 'End', 'Strand']].agg(list)
df2['result'] = df2.apply(lambda x: (x['Start'][0], x['End'][-1], set(x['Strand'])), axis=1)

df2['result']:

OperonID
1132034      (2052, 4997, {+})
1132035      (5123, 9818, {+})
1132036    (11421, 11692, {-})
1132037    (14089, 14877, {+})

也许试试这样的逻辑? 它只有一个临时变量来跟踪您看到的最后一个 OperonID,并在更改后切换开始/结束:

In [21]: lines = open("test.csv").read().splitlines()

In [22]: lines
Out[22]:
['OperonID,GI,Synonym,Start,End,Strand,Length',
 '1132034,397671780,RVBD_0002,2052,3260,+,402',
 '1132034,397671781,RVBD_0003,3280,4437,+,385',
 '1132034,397671782,RVBD_0004,4434,4997,+,187',
 '1132035,397671783,RVBD_0005,5123,7267,+,714',
 '1132035,397671784,RVBD_0006,7302,9818,+,838',
 '1132036,397671786,RVBD_0007Ac,11421,11528,-,35',
 '1132036,397671787,RVBD_0007Bc,11555,11692,-,45',
 '1132037,397671792,RVBD_0012,14089,14877,+,262']

In [23]: cur_operonid = ''

In [24]: cur_end = None
In [27]: cur_start = None
    ...: for line in lines[1:]:
    ...:     cols = line.split(','). # or line.split('\t') for tab-delimit
    ...:     if cur_operonid != cols[0]:  # New OperonID reached
    ...:         if cur_start is not None:
    ...:             print(f"{cur_operonid} went from {cur_start} to {cur_end}")
    ...:         cur_operonid = cols[0]
    ...:         cur_start = cols[3]
    ...:     else:
    ...:         cur_end = cols[4]
    ...:
1132034 went from 2052 to 4997
1132035 went from 5123 to 9818
1132036 went from 11421 to 11692
operon_id = '1132034'
start = ''
end = ''
strand = ''

all_data = list()

with open("operonmap.opr", "r") as f:
    lines = [line.split() for line in f.readlines()]
    body = lines[1:]
    for line in body:
        OperonID, GI, Synonym, Start, End, Strand, Length, *COG_number, Product = line
        data = dict()
        data["OperonID"] = OperonID
        data["GI"] = GI
        data["Synonym"] = Synonym
        data["Start"] = Start
        data["End"] = End
        data["Strand"] = Strand
        data["Length"] = Length
        data["COG_number"] = COG_number
        data["Product"] = Product
        all_data.append(data)

for data in all_data:
    if data["OperonID"] == operon_id:
        start, end, strand = data["Start"], data["End"], data["Strand"]
        print("Start\t", start)
        print("End\t", end)
        print("Strand\t", strand)

OUTPUT

Start    2052
End      3260
Strand   +
Start    3280
End      4437
Strand   +
Start    4434
End      4997
Strand   +

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM