简体   繁体   English

python从文件中打印特定行

[英]python print particular lines from file

The background: 的背景:

                    Table$Gene=Gene1
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   2872     208    0.928 0.00484        0.918        0.937
    1   2664     304    0.822 0.00714        0.808        0.836
    2   2360     104    0.786 0.00766        0.771        0.801
    3   2256      48    0.769 0.00787        0.754        0.784
    4   2208      40    0.755 0.00803        0.739        0.771
    5   2256      48    0.769 0.00787        0.754        0.784
    6   2208      40    0.755 0.00803        0.739        0.771

                Table$Gene=Gene2
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   2872     208    0.938 0.00484        0.918        0.937
    1   2664     304    0.822 0.00714        0.808        0.836
    2   2360     104    0.786 0.00766        0.771        0.801
    3   2256      48    0.769 0.00787        0.754        0.784
    4   1000      40    0.744 0.00803        0.739        0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.

What I want seems simple. 我想要的似乎很简单。 I want to turn the above file into an output that looks like this: 我想将上面的文件转换为如下所示的输出:

Gene1  0.755
Gene2  0.744

ie each gene, and the last number in the survival column from each section. 即每个基因,以及每个部分的存活列中的最后一个数字。

I have tried multiple ways, using regular expression, reading the file in as a list and saying ".next()". 我尝试了多种方法,使用正则表达式,以列表形式读取文件并说“.next()”。 One example of code that I have tried: 我尝试过的一个代码示例:

fileopen = open(sys.argv[1]).readlines()  # Read in the file as a list.
for index,line in enumerate(fileopen):   # Enumerate items in list
    if "Table" in line:  # Find the items with "Table" (This will have my gene name)
            line2 = line.split("=")[1]  # Parse line to get my gene name
            if "\n" in fileopen[index+1]: # This is the problem section.
                print fileopen[index]
            else:
                fileopen[index+1]

So as you can see in the problem section, I was trying to say in this attempt: 正如您在问题部分中看到的那样,我试图在此尝试中说:

if the next item in the list is a new line, print the item, else, the next line is the current line (and then I can split the line to pull out the particular number I want). 如果列表中的下一个项目是新行,则打印该项目,否则,下一行是当前行 (然后我可以拆分该行以提取我想要的特定数字)。

If anyone could correct the code so I can see what I did wrong I'd appreciate it. 如果有人可以纠正代码,那么我可以看到我做错了什么我会很感激。

Bit of overkill, but instead of manually writing parser for each data item use existing package like pandas to read in the csv file. 有点矫枉过正,但不是手动为每个数据项编写解析器,而是使用像pandas这样的现有包来读取csv文件。 Just need to write a bit of code to specify the relevant lines in the file. 只需编写一些代码来指定文件中的相关行。 Un-optimized code (reading file twice): 未优化的代码(读取文件两次):

import pandas as pd
def genetable(gene):
    l = open('gene.txt').readlines()
    l += "\n"  # add newline to end of file in case last line is not newline
    lines = len(l)
    skiprows = -1
    for (i, line) in enumerate(l):
        if "Table$Gene=Gene"+str(gene) in line:
            skiprows = i+1
        if skiprows>=0 and line=="\n":
            skipfooter = lines - i - 1
            df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
            #  assuming tab separated data given your inputs. change as needed
            # assert df.columns.....
            return df
    return "Not Found"

this will read in a DataFrame with all the relevant data in that file 这将在DataFrame中读取该文件中的所有相关数据

can then do: 然后可以这样做:

genetable(2).survival  # series with all survival rates
genetable(2).survival.iloc[-1]   last item in survival

The advantages of this is that you have access to all the items, any mal-formatting of the file will probably be better picked up and prevent incorrect values from being used. 这样做的好处是您可以访问所有项目,可能会更好地拾取文件的任何错误格式,并防止使用不正确的值。 If my own code i would add assertions on column names before returning the pandas DataFrame. 如果我自己的代码我会在返回pandas DataFrame之前在列名称上添加断言。 Want to pick up any errors in parsing early so that it does not propagate. 想要在早期解析时发现任何错误,以便它不会传播。

This worked when I tried it: 当我尝试时,这有效:

gene = 1
for i in range(len(filelines)):
    if filelines[i].strip() == "":
        print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
        gene += 1

You could try something like this (I copied your data into foo.dat ); 你可以试试这样的东西(我将你的数据复制到foo.dat );

In [1]: with open('foo.dat') as input:
   ...:     lines = input.readlines()
   ...:     

Using with makes sure the file is closed after reading. 使用with确保文件在读取后关闭。

In [3]: lines = [ln.strip() for ln in lines]

This gets rid of extra whitespace. 这消除了额外的空白。

In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]

In [6]: startgenes
Out[6]: [0, 10]

In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]

In [8]: emptylines
Out[8]: [9, 17]

Using emptylines relies on the fact that the records are separated by lines containing only whitespace. 使用emptylines依赖于记录由仅包含空格的行分隔的事实。

In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]

In [10]: for first, last in zip(startgenes, lastlines):
   ....:     gene = lines[first].split("=")[1]
   ....:     num = lines[last].split()[-1]
   ....:     print gene, num
   ....:     
Gene1 0.771
Gene2 0.774

here is my solution: 这是我的解决方案:

>>> with open('t.txt','r') as f:
...     for l in f:
...         if "Table" in l:
...             gene = l.split("=")[1][:-1]
...         elif l not in ['\n', '\r\n']:
...             surv = l.split()[3]
...         else:
...             print gene, surv
...
Gene1 0.755
Gene2 0.744

Instead of checking for new line, simply print when you are done reading the file 您可以在阅读完文件后进行打印,而不是检查新行

lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
    if "Table" in line:
        if table != "": # print previous survival
            print table, finalsurvival
        table = line.strip().split('=')[1]
    else:
        try:                
            finalsurvival = line.split('\t')[4]
        except IndexError:
            continue
print table, finalsurvival

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM