简体   繁体   中英

Printing previous line in python using enumerate

I have a file in the below format.

OperonID    GI      Synonym    Start    End  Strand Length  COG_number  Product
1132034 397671780   RVBD_0002   2052    3260    +   402 -   DNA polymerase III subunit beta
1132034 397671781   RVBD_0003   3280    4437    +   385 -   DNA replication and repair protein RecF
1132034 397671782   RVBD_0004   4434    4997    +   187 -   hypothetical protein
1132035 397671783   RVBD_0005   5123    7267    +   714 -   DNA gyrase subunit B
1132035 397671784   RVBD_0006   7302    9818    +   838 -   DNA gyrase subunit A
1132036 397671786   RVBD_0007Ac 11421   11528   -   35  -   hypothetical protein
1132036 397671787   RVBD_0007Bc 11555   11692   -   45  -   hypothetical protein
1132037 397671792   RVBD_0012   14089   14877   +   262 -   hypothetical protein
  • I need the start and end co-ordinates of each Operon ID plus the strand in its own file/string. eg for the operon 1132034 the start co-ordinate is 2052 and the end co-ordinate is 4997, the strand is -.

I know I can probably use enumerate and have the following script so far.

lines = open('operonmap.opr', 'r').read().splitlines()
operon_id = 1132034
start = ''
end = ''
strand = ''

for i,line in enumerate(lines):
      if str(operon_id) in line:
            start += line[28:33]
      else:
            end += line[i-1]
            operonline += start
            operonline += end
            operonline += '\n'

I would then edit the string 'operonline' to include only the start end and strand information if this sort of script worked. Unfortunately it doesn't work, but I hope you can see my logic.

I hope someone's able to help !

Here is a possible implementation. parse_file contains the following variables:

  • this_info : dictionary containing info relating to the current line

  • previous_info : this_info from previous iteration

  • start_info : this_info from the most recent line that was the start of a new operon ID

The desired output is not exactly clear, but adjust the main program (at the end) to write the extracted fields in any form you choose.

def parse_file(input_file):
    """
    reads an opr file, returns a list of dictionaries with info about the operon ids
    """
    results = []
    start_info = previous_info = {}
    with open(input_file) as f:
        next(f)  # ignore first line
        for line in f:
            bits = line.split()

            # dictionary containing information extracted from a
            # particular line
            this_info = {'operon_id': int(bits[0]),
                         'start': int(bits[3]),
                         'end': int(bits[4]),
                         'strand': bits[5]}

            if not previous_info:
                # first line of file
                start_info = this_info

            elif previous_info['operon_id'] != this_info['operon_id']:
                # this is the first line with NEW Operon ID,
                # so add result for previous Operon ID,  
                # of which the end line was the PREVIOUS line
                _add_result(results, start_info, previous_info)
                start_info = this_info  # start line for this ID

            # also adding a sanity check here - the strand
            # should be the same for every line of a given
            # operon ID
            if start_info["strand"] != this_info["strand"]:
                print("warning, strand info inconsistent")

            previous_info = this_info  # ready for next iteration

        _add_result(results, start_info, this_info)  # last ID

    return results


def _add_result(results, start_info, end_info):
    """
    add to the results a dictionary based on start line info
    but with end line info used for the 'end' field
    """
    info = start_info.copy()
    info['end'] = end_info['end']
    results.append(info)


for result in parse_file('operonmap.opr'):
    # write out some info
    print(result['operon_id'],
          result['start'],
          result['end'],
          result['strand'])

This gives:

1132034 2052 4997 +
1132035 5123 9818 +
1132036 11421 11692 -
1132037 14089 14877 +

this is pretty easy if you use pandas, if you want to go that route..

I was able to read your data into a pandas DataFrame then removed the other columns:

   Start    End Strand OperonID
0   2052   3260      +  1132034
1   3280   4437      +  1132034
2   4434   4997      +  1132034
3   5123   7267      +  1132035
4   7302   9818      +  1132035
5  11421  11528      -  1132036
6  11555  11692      -  1132036
7  14089  14877      +  1132037

then I grouped by OperonID and stored the Start and End and Strand values as lists, and made a new column with the first Start and last End per OperonID and the unique Strand values..you could reorganize this anyway you see fit

df2 = df.groupby('OperonID')[['Start', 'End', 'Strand']].agg(list)
df2['result'] = df2.apply(lambda x: (x['Start'][0], x['End'][-1], set(x['Strand'])), axis=1)

df2['result']:

OperonID
1132034      (2052, 4997, {+})
1132035      (5123, 9818, {+})
1132036    (11421, 11692, {-})
1132037    (14089, 14877, {+})

Maybe try something like this logic? It just has a temp variable keeping track of the last OperonID you've seen, and switches the start/end once that changes:

In [21]: lines = open("test.csv").read().splitlines()

In [22]: lines
Out[22]:
['OperonID,GI,Synonym,Start,End,Strand,Length',
 '1132034,397671780,RVBD_0002,2052,3260,+,402',
 '1132034,397671781,RVBD_0003,3280,4437,+,385',
 '1132034,397671782,RVBD_0004,4434,4997,+,187',
 '1132035,397671783,RVBD_0005,5123,7267,+,714',
 '1132035,397671784,RVBD_0006,7302,9818,+,838',
 '1132036,397671786,RVBD_0007Ac,11421,11528,-,35',
 '1132036,397671787,RVBD_0007Bc,11555,11692,-,45',
 '1132037,397671792,RVBD_0012,14089,14877,+,262']

In [23]: cur_operonid = ''

In [24]: cur_end = None
In [27]: cur_start = None
    ...: for line in lines[1:]:
    ...:     cols = line.split(','). # or line.split('\t') for tab-delimit
    ...:     if cur_operonid != cols[0]:  # New OperonID reached
    ...:         if cur_start is not None:
    ...:             print(f"{cur_operonid} went from {cur_start} to {cur_end}")
    ...:         cur_operonid = cols[0]
    ...:         cur_start = cols[3]
    ...:     else:
    ...:         cur_end = cols[4]
    ...:
1132034 went from 2052 to 4997
1132035 went from 5123 to 9818
1132036 went from 11421 to 11692
operon_id = '1132034'
start = ''
end = ''
strand = ''

all_data = list()

with open("operonmap.opr", "r") as f:
    lines = [line.split() for line in f.readlines()]
    body = lines[1:]
    for line in body:
        OperonID, GI, Synonym, Start, End, Strand, Length, *COG_number, Product = line
        data = dict()
        data["OperonID"] = OperonID
        data["GI"] = GI
        data["Synonym"] = Synonym
        data["Start"] = Start
        data["End"] = End
        data["Strand"] = Strand
        data["Length"] = Length
        data["COG_number"] = COG_number
        data["Product"] = Product
        all_data.append(data)

for data in all_data:
    if data["OperonID"] == operon_id:
        start, end, strand = data["Start"], data["End"], data["Strand"]
        print("Start\t", start)
        print("End\t", end)
        print("Strand\t", strand)

OUTPUT

Start    2052
End      3260
Strand   +
Start    3280
End      4437
Strand   +
Start    4434
End      4997
Strand   +

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM