python从文件中打印特定行

Question

The background: 的背景：

                    Table$Gene=Gene1
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   2872     208    0.928 0.00484        0.918        0.937
    1   2664     304    0.822 0.00714        0.808        0.836
    2   2360     104    0.786 0.00766        0.771        0.801
    3   2256      48    0.769 0.00787        0.754        0.784
    4   2208      40    0.755 0.00803        0.739        0.771
    5   2256      48    0.769 0.00787        0.754        0.784
    6   2208      40    0.755 0.00803        0.739        0.771

                Table$Gene=Gene2
 time n.risk n.event survival std.err lower 95% CI upper 95% CI
    0   2872     208    0.938 0.00484        0.918        0.937
    1   2664     304    0.822 0.00714        0.808        0.836
    2   2360     104    0.786 0.00766        0.771        0.801
    3   2256      48    0.769 0.00787        0.754        0.784
    4   1000      40    0.744 0.00803        0.739        0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.

What I want seems simple. 我想要的似乎很简单。 I want to turn the above file into an output that looks like this: 我想将上面的文件转换为如下所示的输出：

Gene1  0.755
Gene2  0.744

ie each gene, and the last number in the survival column from each section. 即每个基因，以及每个部分的存活列中的最后一个数字。

I have tried multiple ways, using regular expression, reading the file in as a list and saying ".next()". 我尝试了多种方法，使用正则表达式，以列表形式读取文件并说“.next（）”。 One example of code that I have tried: 我尝试过的一个代码示例：

fileopen = open(sys.argv[1]).readlines()  # Read in the file as a list.
for index,line in enumerate(fileopen):   # Enumerate items in list
    if "Table" in line:  # Find the items with "Table" (This will have my gene name)
            line2 = line.split("=")[1]  # Parse line to get my gene name
            if "\n" in fileopen[index+1]: # This is the problem section.
                print fileopen[index]
            else:
                fileopen[index+1]

So as you can see in the problem section, I was trying to say in this attempt: 正如您在问题部分中看到的那样，我试图在此尝试中说：

if the next item in the list is a new line, print the item, else, the next line is the current line (and then I can split the line to pull out the particular number I want). 如果列表中的下一个项目是新行，则打印该项目，否则，下一行是当前行 （然后我可以拆分该行以提取我想要的特定数字）。

If anyone could correct the code so I can see what I did wrong I'd appreciate it. 如果有人可以纠正代码，那么我可以看到我做错了什么我会很感激。

Answer 1

Bit of overkill, but instead of manually writing parser for each data item use existing package like pandas to read in the csv file. 有点矫枉过正，但不是手动为每个数据项编写解析器，而是使用像pandas这样的现有包来读取csv文件。 Just need to write a bit of code to specify the relevant lines in the file. 只需编写一些代码来指定文件中的相关行。 Un-optimized code (reading file twice): 未优化的代码（读取文件两次）：

import pandas as pd
def genetable(gene):
    l = open('gene.txt').readlines()
    l += "\n"  # add newline to end of file in case last line is not newline
    lines = len(l)
    skiprows = -1
    for (i, line) in enumerate(l):
        if "Table$Gene=Gene"+str(gene) in line:
            skiprows = i+1
        if skiprows>=0 and line=="\n":
            skipfooter = lines - i - 1
            df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
            #  assuming tab separated data given your inputs. change as needed
            # assert df.columns.....
            return df
    return "Not Found"

this will read in a DataFrame with all the relevant data in that file 这将在DataFrame中读取该文件中的所有相关数据

can then do: 然后可以这样做：

genetable(2).survival  # series with all survival rates
genetable(2).survival.iloc[-1]   last item in survival

The advantages of this is that you have access to all the items, any mal-formatting of the file will probably be better picked up and prevent incorrect values from being used. 这样做的好处是您可以访问所有项目，可能会更好地拾取文件的任何错误格式，并防止使用不正确的值。 If my own code i would add assertions on column names before returning the pandas DataFrame. 如果我自己的代码我会在返回pandas DataFrame之前在列名称上添加断言。 Want to pick up any errors in parsing early so that it does not propagate. 想要在早期解析时发现任何错误，以便它不会传播。

Answer 2

This worked when I tried it: 当我尝试时，这有效：

gene = 1
for i in range(len(filelines)):
    if filelines[i].strip() == "":
        print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
        gene += 1

Answer 3

You could try something like this (I copied your data into foo.dat ); 你可以试试这样的东西（我将你的数据复制到foo.dat ）;

In [1]: with open('foo.dat') as input:
   ...:     lines = input.readlines()
   ...:

Using with makes sure the file is closed after reading. 使用with确保文件在读取后关闭。

In [3]: lines = [ln.strip() for ln in lines]

This gets rid of extra whitespace. 这消除了额外的空白。

In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]

In [6]: startgenes
Out[6]: [0, 10]

In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]

In [8]: emptylines
Out[8]: [9, 17]

Using emptylines relies on the fact that the records are separated by lines containing only whitespace. 使用emptylines依赖于记录由仅包含空格的行分隔的事实。

In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]

In [10]: for first, last in zip(startgenes, lastlines):
   ....:     gene = lines[first].split("=")[1]
   ....:     num = lines[last].split()[-1]
   ....:     print gene, num
   ....:     
Gene1 0.771
Gene2 0.774

Answer 4

here is my solution: 这是我的解决方案：

>>> with open('t.txt','r') as f:
...     for l in f:
...         if "Table" in l:
...             gene = l.split("=")[1][:-1]
...         elif l not in ['\n', '\r\n']:
...             surv = l.split()[3]
...         else:
...             print gene, surv
...
Gene1 0.755
Gene2 0.744

Answer 5

Instead of checking for new line, simply print when you are done reading the file 您可以在阅读完文件后进行打印，而不是检查新行

lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
    if "Table" in line:
        if table != "": # print previous survival
            print table, finalsurvival
        table = line.strip().split('=')[1]
    else:
        try:                
            finalsurvival = line.split('\t')[4]
        except IndexError:
            continue
print table, finalsurvival

python从文件中打印特定行

问题描述

5 个解决方案

解决方案1
1 2014-08-08 12:25:51

解决方案2
0 2014-08-08 11:19:48

解决方案3
0 2014-08-08 11:20:03

解决方案4
0 2014-08-08 11:20:03

解决方案5
0 2014-08-08 11:20:04

python从文件中打印特定行

问题描述

5 个解决方案

解决方案1 1 2014-08-08 12:25:51

解决方案2 0 2014-08-08 11:19:48

解决方案3 0 2014-08-08 11:20:03

解决方案4 0 2014-08-08 11:20:03

解决方案5 0 2014-08-08 11:20:04

解决方案1
1 2014-08-08 12:25:51

解决方案2
0 2014-08-08 11:19:48

解决方案3
0 2014-08-08 11:20:03

解决方案4
0 2014-08-08 11:20:03

解决方案5
0 2014-08-08 11:20:04