将以下日志文件（rseqc输出）读入python（3）的最佳方法是什么？

Question

I have 100s of the the following log files, which I'd like to load into two pandas DataFrame(s) per dataset (or a DF and a dictionary or some other combination). 我有以下日志文件的100个，我想将每个数据集（或DF和字典或其他组合）加载到两个pandas DataFrame中。

What is the most efficient way to parse this file into python? 将这个文件解析为python的最有效方法是什么？

Total Reads                   38948036
Total Tags                    49242267
Total Assigned Tags           44506208
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb
CDS_Exons           34175771            24133928            706.17
5'UTR_Exons         6341914             1366084             215.41
3'UTR_Exons         24930397            8269466             331.70
Introns             929421174           8172570             8.79
TSS_up_1kb          19267668            1044739             54.22
TSS_up_5kb          87647060            1433110             16.35
TSS_up_10kb         159281339           1549571             9.73
TES_down_1kb        19416426            300476              15.48
TES_down_5kb        83322244            718139              8.62
TES_down_10kb       147880768           1014589             6.86
=====================================================================

Obviously, the top three lines have parameter name/value, while the bottom section has group/total bases/tag count/tags per kb. 显然，前三行具有参数名称/值，而下半部分具有组/总碱基/标签计数/每kb标签。 All of these will always exist, and be numeric, in all of my datasets, so robust NA control is not neccessary. 所有这些数据在我的所有数据集中都将始终存在并为数字，因此不需要强大的NA控制。

At the moment, I'm parsing the file into a nested list (one per dataset ie file), stripping the whitespace, and pulling out the values by index from the list - the challenge is that if the tool that's generating the file gets upgraded/output format slightly changed, for example by adding a new tag, I'll have a very frustrating time debugging. 目前，我正在将文件解析为嵌套列表（每个数据集即文件一个），剥离空格，然后从列表中按索引提取值-挑战在于，如果生成文件的工具已升级/ output格式稍有变化，例如，通过添加新标签，我将在调试过程中感到非常沮丧。

Answer 1

You can try read_fwf and read_csv : 您可以尝试read_fwf和read_csv ：

import pandas as pd
import io

temp=u"""Total Reads                   38948036
Total Tags                    49242267
Total Assigned Tags           44506208
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb
CDS_Exons           34175771            24133928            706.17
5'UTR_Exons         6341914             1366084             215.41
3'UTR_Exons         24930397            8269466             331.70
Introns             929421174           8172570             8.79
TSS_up_1kb          19267668            1044739             54.22
TSS_up_5kb          87647060            1433110             16.35
TSS_up_10kb         159281339           1549571             9.73
TES_down_1kb        19416426            300476              15.48
TES_down_5kb        83322244            718139              8.62
TES_down_10kb       147880768           1014589             6.86
====================================================================="""

#after testing replace io.StringIO(temp) to filename
df1 = pd.read_fwf(io.StringIO(temp), 
                 widths=[30,8], #widths of columns                  
                 nrows=3, #read only first 3 rows
                 index_col=[0], #set first column to index
                 names=[None, 0]) #set column names to None and 0

print (df1)
                            0
Total Reads          38948036
Total Tags           49242267
Total Assigned Tags  44506208

print (df1.T)
   Total Reads  Total Tags  Total Assigned Tags
0     38948036    49242267             44506208

#after testing replace io.StringIO(temp) to filename
df2 = pd.read_csv(io.StringIO(temp), 
                 sep="\s+", #separator is arbitrary whitespace
                 skiprows=4, #skip first 4 rows
                 comment='=') #skip all rows with first char =

print (df2)
           Group  Total_bases  Tag_count  Tags/Kb
0      CDS_Exons     34175771   24133928   706.17
1    5'UTR_Exons      6341914    1366084   215.41
2    3'UTR_Exons     24930397    8269466   331.70
3        Introns    929421174    8172570     8.79
4     TSS_up_1kb     19267668    1044739    54.22
5     TSS_up_5kb     87647060    1433110    16.35
6    TSS_up_10kb    159281339    1549571     9.73
7   TES_down_1kb     19416426     300476    15.48
8   TES_down_5kb     83322244     718139     8.62
9  TES_down_10kb    147880768    1014589     6.86

If width of first columns is not always [30,8] , use: 如果第一列的宽度并不总是[30,8] ，请使用：

#after testing replace io.StringIO(temp) to filename
df1 = pd.read_csv(io.StringIO(temp), 
                 nrows=3, #skip first 3 rows
                 sep="\s\s+", #separator is 2 or more arbitrary whitespaces
                 engine="python", #clean ParserWarning
                 index_col=0, #set first column to index
                 header=None, #no header
                 names=[None, 0]) #set columns names to None (no index name) and 0

print (df1)
                            0
Total Reads          38948036
Total Tags           49242267
Total Assigned Tags  44506208

print (df1.T)
   Total Reads  Total Tags  Total Assigned Tags
0     38948036    49242267             44506208

将以下日志文件（rseqc输出）读入python（3）的最佳方法是什么？

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-03-10 06:22:00

将以下日志文​​件（rseqc输出）读入python（3）的最佳方法是什么？

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-03-10 06:22:00

将以下日志文件（rseqc输出）读入python（3）的最佳方法是什么？

解决方案1
1 已采纳 2016-03-10 06:22:00