[英]What's the best way to read the following log file (rseqc ouput) into python (3)?
我有以下日志文件的100个,我想将每个数据集(或DF和字典或其他组合)加载到两个pandas DataFrame中。
将这个文件解析为python的最有效方法是什么?
Total Reads 38948036 Total Tags 49242267 Total Assigned Tags 44506208 ===================================================================== Group Total_bases Tag_count Tags/Kb CDS_Exons 34175771 24133928 706.17 5'UTR_Exons 6341914 1366084 215.41 3'UTR_Exons 24930397 8269466 331.70 Introns 929421174 8172570 8.79 TSS_up_1kb 19267668 1044739 54.22 TSS_up_5kb 87647060 1433110 16.35 TSS_up_10kb 159281339 1549571 9.73 TES_down_1kb 19416426 300476 15.48 TES_down_5kb 83322244 718139 8.62 TES_down_10kb 147880768 1014589 6.86 =====================================================================
显然,前三行具有参数名称/值,而下半部分具有组/总碱基/标签计数/每kb标签。 所有这些数据在我的所有数据集中都将始终存在并为数字,因此不需要强大的NA控制。
目前,我正在将文件解析为嵌套列表(每个数据集即文件一个),剥离空格,然后从列表中按索引提取值-挑战在于,如果生成文件的工具已升级/ output格式稍有变化,例如,通过添加新标签,我将在调试过程中感到非常沮丧。
import pandas as pd
import io
temp=u"""Total Reads 38948036
Total Tags 49242267
Total Assigned Tags 44506208
=====================================================================
Group Total_bases Tag_count Tags/Kb
CDS_Exons 34175771 24133928 706.17
5'UTR_Exons 6341914 1366084 215.41
3'UTR_Exons 24930397 8269466 331.70
Introns 929421174 8172570 8.79
TSS_up_1kb 19267668 1044739 54.22
TSS_up_5kb 87647060 1433110 16.35
TSS_up_10kb 159281339 1549571 9.73
TES_down_1kb 19416426 300476 15.48
TES_down_5kb 83322244 718139 8.62
TES_down_10kb 147880768 1014589 6.86
====================================================================="""
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_fwf(io.StringIO(temp),
widths=[30,8], #widths of columns
nrows=3, #read only first 3 rows
index_col=[0], #set first column to index
names=[None, 0]) #set column names to None and 0
print (df1)
0
Total Reads 38948036
Total Tags 49242267
Total Assigned Tags 44506208
print (df1.T)
Total Reads Total Tags Total Assigned Tags
0 38948036 49242267 44506208
#after testing replace io.StringIO(temp) to filename
df2 = pd.read_csv(io.StringIO(temp),
sep="\s+", #separator is arbitrary whitespace
skiprows=4, #skip first 4 rows
comment='=') #skip all rows with first char =
print (df2)
Group Total_bases Tag_count Tags/Kb
0 CDS_Exons 34175771 24133928 706.17
1 5'UTR_Exons 6341914 1366084 215.41
2 3'UTR_Exons 24930397 8269466 331.70
3 Introns 929421174 8172570 8.79
4 TSS_up_1kb 19267668 1044739 54.22
5 TSS_up_5kb 87647060 1433110 16.35
6 TSS_up_10kb 159281339 1549571 9.73
7 TES_down_1kb 19416426 300476 15.48
8 TES_down_5kb 83322244 718139 8.62
9 TES_down_10kb 147880768 1014589 6.86
如果第一列的宽度并不总是[30,8]
,请使用:
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_csv(io.StringIO(temp),
nrows=3, #skip first 3 rows
sep="\s\s+", #separator is 2 or more arbitrary whitespaces
engine="python", #clean ParserWarning
index_col=0, #set first column to index
header=None, #no header
names=[None, 0]) #set columns names to None (no index name) and 0
print (df1)
0
Total Reads 38948036
Total Tags 49242267
Total Assigned Tags 44506208
print (df1.T)
Total Reads Total Tags Total Assigned Tags
0 38948036 49242267 44506208
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.