简体   繁体   中英

python print particular lines from file

The background:

                    Table$Gene=Gene1
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   2872     208    0.928 0.00484        0.918        0.937
    1   2664     304    0.822 0.00714        0.808        0.836
    2   2360     104    0.786 0.00766        0.771        0.801
    3   2256      48    0.769 0.00787        0.754        0.784
    4   2208      40    0.755 0.00803        0.739        0.771
    5   2256      48    0.769 0.00787        0.754        0.784
    6   2208      40    0.755 0.00803        0.739        0.771

                Table$Gene=Gene2
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   2872     208    0.938 0.00484        0.918        0.937
    1   2664     304    0.822 0.00714        0.808        0.836
    2   2360     104    0.786 0.00766        0.771        0.801
    3   2256      48    0.769 0.00787        0.754        0.784
    4   1000      40    0.744 0.00803        0.739        0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.

What I want seems simple. I want to turn the above file into an output that looks like this:

Gene1  0.755
Gene2  0.744

ie each gene, and the last number in the survival column from each section.

I have tried multiple ways, using regular expression, reading the file in as a list and saying ".next()". One example of code that I have tried:

fileopen = open(sys.argv[1]).readlines()  # Read in the file as a list.
for index,line in enumerate(fileopen):   # Enumerate items in list
    if "Table" in line:  # Find the items with "Table" (This will have my gene name)
            line2 = line.split("=")[1]  # Parse line to get my gene name
            if "\n" in fileopen[index+1]: # This is the problem section.
                print fileopen[index]
            else:
                fileopen[index+1]

So as you can see in the problem section, I was trying to say in this attempt:

if the next item in the list is a new line, print the item, else, the next line is the current line (and then I can split the line to pull out the particular number I want).

If anyone could correct the code so I can see what I did wrong I'd appreciate it.

Bit of overkill, but instead of manually writing parser for each data item use existing package like pandas to read in the csv file. Just need to write a bit of code to specify the relevant lines in the file. Un-optimized code (reading file twice):

import pandas as pd
def genetable(gene):
    l = open('gene.txt').readlines()
    l += "\n"  # add newline to end of file in case last line is not newline
    lines = len(l)
    skiprows = -1
    for (i, line) in enumerate(l):
        if "Table$Gene=Gene"+str(gene) in line:
            skiprows = i+1
        if skiprows>=0 and line=="\n":
            skipfooter = lines - i - 1
            df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
            #  assuming tab separated data given your inputs. change as needed
            # assert df.columns.....
            return df
    return "Not Found"

this will read in a DataFrame with all the relevant data in that file

can then do:

genetable(2).survival  # series with all survival rates
genetable(2).survival.iloc[-1]   last item in survival

The advantages of this is that you have access to all the items, any mal-formatting of the file will probably be better picked up and prevent incorrect values from being used. If my own code i would add assertions on column names before returning the pandas DataFrame. Want to pick up any errors in parsing early so that it does not propagate.

This worked when I tried it:

gene = 1
for i in range(len(filelines)):
    if filelines[i].strip() == "":
        print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
        gene += 1

You could try something like this (I copied your data into foo.dat );

In [1]: with open('foo.dat') as input:
   ...:     lines = input.readlines()
   ...:     

Using with makes sure the file is closed after reading.

In [3]: lines = [ln.strip() for ln in lines]

This gets rid of extra whitespace.

In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]

In [6]: startgenes
Out[6]: [0, 10]

In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]

In [8]: emptylines
Out[8]: [9, 17]

Using emptylines relies on the fact that the records are separated by lines containing only whitespace.

In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]

In [10]: for first, last in zip(startgenes, lastlines):
   ....:     gene = lines[first].split("=")[1]
   ....:     num = lines[last].split()[-1]
   ....:     print gene, num
   ....:     
Gene1 0.771
Gene2 0.774

here is my solution:

>>> with open('t.txt','r') as f:
...     for l in f:
...         if "Table" in l:
...             gene = l.split("=")[1][:-1]
...         elif l not in ['\n', '\r\n']:
...             surv = l.split()[3]
...         else:
...             print gene, surv
...
Gene1 0.755
Gene2 0.744

Instead of checking for new line, simply print when you are done reading the file

lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
    if "Table" in line:
        if table != "": # print previous survival
            print table, finalsurvival
        table = line.strip().split('=')[1]
    else:
        try:                
            finalsurvival = line.split('\t')[4]
        except IndexError:
            continue
print table, finalsurvival

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM