简体   繁体   English

使用枚举在 python 中打印上一行

[英]Printing previous line in python using enumerate

I have a file in the below format.我有一个以下格式的文件。

OperonID    GI      Synonym    Start    End  Strand Length  COG_number  Product
1132034 397671780   RVBD_0002   2052    3260    +   402 -   DNA polymerase III subunit beta
1132034 397671781   RVBD_0003   3280    4437    +   385 -   DNA replication and repair protein RecF
1132034 397671782   RVBD_0004   4434    4997    +   187 -   hypothetical protein
1132035 397671783   RVBD_0005   5123    7267    +   714 -   DNA gyrase subunit B
1132035 397671784   RVBD_0006   7302    9818    +   838 -   DNA gyrase subunit A
1132036 397671786   RVBD_0007Ac 11421   11528   -   35  -   hypothetical protein
1132036 397671787   RVBD_0007Bc 11555   11692   -   45  -   hypothetical protein
1132037 397671792   RVBD_0012   14089   14877   +   262 -   hypothetical protein
  • I need the start and end co-ordinates of each Operon ID plus the strand in its own file/string.我需要每个 Operon ID 的开始和结束坐标以及它自己的文件/字符串中的链。 eg for the operon 1132034 the start co-ordinate is 2052 and the end co-ordinate is 4997, the strand is -.例如,对于操作子 1132034,起始坐标为 2052,结束坐标为 4997,链为 -。

I know I can probably use enumerate and have the following script so far.我知道到目前为止我可能可以使用 enumerate 并拥有以下脚本。

lines = open('operonmap.opr', 'r').read().splitlines()
operon_id = 1132034
start = ''
end = ''
strand = ''

for i,line in enumerate(lines):
      if str(operon_id) in line:
            start += line[28:33]
      else:
            end += line[i-1]
            operonline += start
            operonline += end
            operonline += '\n'

I would then edit the string 'operonline' to include only the start end and strand information if this sort of script worked.然后,如果这种脚本有效,我将编辑字符串“operonline”以仅包含起始端和链信息。 Unfortunately it doesn't work, but I hope you can see my logic.不幸的是它不起作用,但我希望你能看到我的逻辑。

I hope someone's able to help !我希望有人能够提供帮助!

Here is a possible implementation.这是一个可能的实现。 parse_file contains the following variables: parse_file包含以下变量:

  • this_info : dictionary containing info relating to the current line this_info :包含与当前行相关的信息的字典

  • previous_info : this_info from previous iteration previous_info :来自上一次迭代的this_info

  • start_info : this_info from the most recent line that was the start of a new operon ID start_info : this_info来自作为新操作子 ID 开始的最近行

The desired output is not exactly clear, but adjust the main program (at the end) to write the extracted fields in any form you choose.所需的 output 并不完全清楚,但调整主程序(最后)以您选择的任何形式写入提取的字段。

def parse_file(input_file):
    """
    reads an opr file, returns a list of dictionaries with info about the operon ids
    """
    results = []
    start_info = previous_info = {}
    with open(input_file) as f:
        next(f)  # ignore first line
        for line in f:
            bits = line.split()

            # dictionary containing information extracted from a
            # particular line
            this_info = {'operon_id': int(bits[0]),
                         'start': int(bits[3]),
                         'end': int(bits[4]),
                         'strand': bits[5]}

            if not previous_info:
                # first line of file
                start_info = this_info

            elif previous_info['operon_id'] != this_info['operon_id']:
                # this is the first line with NEW Operon ID,
                # so add result for previous Operon ID,  
                # of which the end line was the PREVIOUS line
                _add_result(results, start_info, previous_info)
                start_info = this_info  # start line for this ID

            # also adding a sanity check here - the strand
            # should be the same for every line of a given
            # operon ID
            if start_info["strand"] != this_info["strand"]:
                print("warning, strand info inconsistent")

            previous_info = this_info  # ready for next iteration

        _add_result(results, start_info, this_info)  # last ID

    return results


def _add_result(results, start_info, end_info):
    """
    add to the results a dictionary based on start line info
    but with end line info used for the 'end' field
    """
    info = start_info.copy()
    info['end'] = end_info['end']
    results.append(info)


for result in parse_file('operonmap.opr'):
    # write out some info
    print(result['operon_id'],
          result['start'],
          result['end'],
          result['strand'])

This gives:这给出了:

1132034 2052 4997 +
1132035 5123 9818 +
1132036 11421 11692 -
1132037 14089 14877 +

this is pretty easy if you use pandas, if you want to go that route..如果您使用 pandas,这很容易,如果您想使用 go 这条路线..

I was able to read your data into a pandas DataFrame then removed the other columns:我能够将您的数据读入pandas DataFrame然后删除其他列:

   Start    End Strand OperonID
0   2052   3260      +  1132034
1   3280   4437      +  1132034
2   4434   4997      +  1132034
3   5123   7267      +  1132035
4   7302   9818      +  1132035
5  11421  11528      -  1132036
6  11555  11692      -  1132036
7  14089  14877      +  1132037

then I grouped by OperonID and stored the Start and End and Strand values as lists, and made a new column with the first Start and last End per OperonID and the unique Strand values..you could reorganize this anyway you see fit然后我按OperonID分组并将StartEnd以及Strand值存储为列表,并创建了一个新列,其中包含每个OperonID的第一个Start和最后一个End以及唯一的Strand值..您可以以任何您认为合适的方式重新组织它

df2 = df.groupby('OperonID')[['Start', 'End', 'Strand']].agg(list)
df2['result'] = df2.apply(lambda x: (x['Start'][0], x['End'][-1], set(x['Strand'])), axis=1)

df2['result']:

OperonID
1132034      (2052, 4997, {+})
1132035      (5123, 9818, {+})
1132036    (11421, 11692, {-})
1132037    (14089, 14877, {+})

Maybe try something like this logic?也许试试这样的逻辑? It just has a temp variable keeping track of the last OperonID you've seen, and switches the start/end once that changes:它只有一个临时变量来跟踪您看到的最后一个 OperonID,并在更改后切换开始/结束:

In [21]: lines = open("test.csv").read().splitlines()

In [22]: lines
Out[22]:
['OperonID,GI,Synonym,Start,End,Strand,Length',
 '1132034,397671780,RVBD_0002,2052,3260,+,402',
 '1132034,397671781,RVBD_0003,3280,4437,+,385',
 '1132034,397671782,RVBD_0004,4434,4997,+,187',
 '1132035,397671783,RVBD_0005,5123,7267,+,714',
 '1132035,397671784,RVBD_0006,7302,9818,+,838',
 '1132036,397671786,RVBD_0007Ac,11421,11528,-,35',
 '1132036,397671787,RVBD_0007Bc,11555,11692,-,45',
 '1132037,397671792,RVBD_0012,14089,14877,+,262']

In [23]: cur_operonid = ''

In [24]: cur_end = None
In [27]: cur_start = None
    ...: for line in lines[1:]:
    ...:     cols = line.split(','). # or line.split('\t') for tab-delimit
    ...:     if cur_operonid != cols[0]:  # New OperonID reached
    ...:         if cur_start is not None:
    ...:             print(f"{cur_operonid} went from {cur_start} to {cur_end}")
    ...:         cur_operonid = cols[0]
    ...:         cur_start = cols[3]
    ...:     else:
    ...:         cur_end = cols[4]
    ...:
1132034 went from 2052 to 4997
1132035 went from 5123 to 9818
1132036 went from 11421 to 11692
operon_id = '1132034'
start = ''
end = ''
strand = ''

all_data = list()

with open("operonmap.opr", "r") as f:
    lines = [line.split() for line in f.readlines()]
    body = lines[1:]
    for line in body:
        OperonID, GI, Synonym, Start, End, Strand, Length, *COG_number, Product = line
        data = dict()
        data["OperonID"] = OperonID
        data["GI"] = GI
        data["Synonym"] = Synonym
        data["Start"] = Start
        data["End"] = End
        data["Strand"] = Strand
        data["Length"] = Length
        data["COG_number"] = COG_number
        data["Product"] = Product
        all_data.append(data)

for data in all_data:
    if data["OperonID"] == operon_id:
        start, end, strand = data["Start"], data["End"], data["Strand"]
        print("Start\t", start)
        print("End\t", end)
        print("Strand\t", strand)

OUTPUT OUTPUT

Start    2052
End      3260
Strand   +
Start    3280
End      4437
Strand   +
Start    4434
End      4997
Strand   +

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM