[英]python print particular lines from file
The background: 的背景:
Table$Gene=Gene1
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.928 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 2208 40 0.755 0.00803 0.739 0.771
5 2256 48 0.769 0.00787 0.754 0.784
6 2208 40 0.755 0.00803 0.739 0.771
Table$Gene=Gene2
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0 2872 208 0.938 0.00484 0.918 0.937
1 2664 304 0.822 0.00714 0.808 0.836
2 2360 104 0.786 0.00766 0.771 0.801
3 2256 48 0.769 0.00787 0.754 0.784
4 1000 40 0.744 0.00803 0.739 0.774
#There is a new line ("\n") here too, it just doesn't come out in the code.
What I want seems simple. 我想要的似乎很简单。 I want to turn the above file into an output that looks like this:
我想将上面的文件转换为如下所示的输出:
Gene1 0.755
Gene2 0.744
ie each gene, and the last number in the survival column from each section. 即每个基因,以及每个部分的存活列中的最后一个数字。
I have tried multiple ways, using regular expression, reading the file in as a list and saying ".next()". 我尝试了多种方法,使用正则表达式,以列表形式读取文件并说“.next()”。 One example of code that I have tried:
我尝试过的一个代码示例:
fileopen = open(sys.argv[1]).readlines() # Read in the file as a list.
for index,line in enumerate(fileopen): # Enumerate items in list
if "Table" in line: # Find the items with "Table" (This will have my gene name)
line2 = line.split("=")[1] # Parse line to get my gene name
if "\n" in fileopen[index+1]: # This is the problem section.
print fileopen[index]
else:
fileopen[index+1]
So as you can see in the problem section, I was trying to say in this attempt: 正如您在问题部分中看到的那样,我试图在此尝试中说:
if the next item in the list is a new line, print the item, else, the next line is the current line (and then I can split the line to pull out the particular number I want). 如果列表中的下一个项目是新行,则打印该项目,否则,下一行是当前行 (然后我可以拆分该行以提取我想要的特定数字)。
If anyone could correct the code so I can see what I did wrong I'd appreciate it. 如果有人可以纠正代码,那么我可以看到我做错了什么我会很感激。
Bit of overkill, but instead of manually writing parser for each data item use existing package like pandas to read in the csv file. 有点矫枉过正,但不是手动为每个数据项编写解析器,而是使用像pandas这样的现有包来读取csv文件。 Just need to write a bit of code to specify the relevant lines in the file.
只需编写一些代码来指定文件中的相关行。 Un-optimized code (reading file twice):
未优化的代码(读取文件两次):
import pandas as pd
def genetable(gene):
l = open('gene.txt').readlines()
l += "\n" # add newline to end of file in case last line is not newline
lines = len(l)
skiprows = -1
for (i, line) in enumerate(l):
if "Table$Gene=Gene"+str(gene) in line:
skiprows = i+1
if skiprows>=0 and line=="\n":
skipfooter = lines - i - 1
df = pd.read_csv('gene.txt', sep='\t', engine='python', skiprows=skiprows, skipfooter=skipfooter)
# assuming tab separated data given your inputs. change as needed
# assert df.columns.....
return df
return "Not Found"
this will read in a DataFrame with all the relevant data in that file 这将在DataFrame中读取该文件中的所有相关数据
can then do: 然后可以这样做:
genetable(2).survival # series with all survival rates
genetable(2).survival.iloc[-1] last item in survival
The advantages of this is that you have access to all the items, any mal-formatting of the file will probably be better picked up and prevent incorrect values from being used. 这样做的好处是您可以访问所有项目,可能会更好地拾取文件的任何错误格式,并防止使用不正确的值。 If my own code i would add assertions on column names before returning the pandas DataFrame.
如果我自己的代码我会在返回pandas DataFrame之前在列名称上添加断言。 Want to pick up any errors in parsing early so that it does not propagate.
想要在早期解析时发现任何错误,以便它不会传播。
This worked when I tried it: 当我尝试时,这有效:
gene = 1
for i in range(len(filelines)):
if filelines[i].strip() == "":
print("Gene" + str(gene) + " " + filelines[i-1].split()[3])
gene += 1
You could try something like this (I copied your data into foo.dat
); 你可以试试这样的东西(我将你的数据复制到
foo.dat
);
In [1]: with open('foo.dat') as input:
...: lines = input.readlines()
...:
Using with
makes sure the file is closed after reading. 使用
with
确保文件在读取后关闭。
In [3]: lines = [ln.strip() for ln in lines]
This gets rid of extra whitespace. 这消除了额外的空白。
In [5]: startgenes = [n for n, ln in enumerate(lines) if ln.startswith("Table")]
In [6]: startgenes
Out[6]: [0, 10]
In [7]: emptylines = [n for n, ln in enumerate(lines) if len(ln) == 0]
In [8]: emptylines
Out[8]: [9, 17]
Using emptylines
relies on the fact that the records are separated by lines containing only whitespace. 使用
emptylines
依赖于记录由仅包含空格的行分隔的事实。
In [9]: lastlines = [n-1 for n, ln in enumerate(lines) if len(ln) == 0]
In [10]: for first, last in zip(startgenes, lastlines):
....: gene = lines[first].split("=")[1]
....: num = lines[last].split()[-1]
....: print gene, num
....:
Gene1 0.771
Gene2 0.774
here is my solution: 这是我的解决方案:
>>> with open('t.txt','r') as f:
... for l in f:
... if "Table" in l:
... gene = l.split("=")[1][:-1]
... elif l not in ['\n', '\r\n']:
... surv = l.split()[3]
... else:
... print gene, surv
...
Gene1 0.755
Gene2 0.744
Instead of checking for new line, simply print when you are done reading the file 您可以在阅读完文件后进行打印,而不是检查新行
lines = open("testgenes.txt").readlines()
table = ""
finalsurvival = 0.0
for line in lines:
if "Table" in line:
if table != "": # print previous survival
print table, finalsurvival
table = line.strip().split('=')[1]
else:
try:
finalsurvival = line.split('\t')[4]
except IndexError:
continue
print table, finalsurvival
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.