What's the best way to read the following log file (rseqc ouput) into python (3)?

Question

I have 100s of the the following log files, which I'd like to load into two pandas DataFrame(s) per dataset (or a DF and a dictionary or some other combination).

What is the most efficient way to parse this file into python?

Total Reads                   38948036
Total Tags                    49242267
Total Assigned Tags           44506208
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb
CDS_Exons           34175771            24133928            706.17
5'UTR_Exons         6341914             1366084             215.41
3'UTR_Exons         24930397            8269466             331.70
Introns             929421174           8172570             8.79
TSS_up_1kb          19267668            1044739             54.22
TSS_up_5kb          87647060            1433110             16.35
TSS_up_10kb         159281339           1549571             9.73
TES_down_1kb        19416426            300476              15.48
TES_down_5kb        83322244            718139              8.62
TES_down_10kb       147880768           1014589             6.86
=====================================================================

Obviously, the top three lines have parameter name/value, while the bottom section has group/total bases/tag count/tags per kb. All of these will always exist, and be numeric, in all of my datasets, so robust NA control is not neccessary.

At the moment, I'm parsing the file into a nested list (one per dataset ie file), stripping the whitespace, and pulling out the values by index from the list - the challenge is that if the tool that's generating the file gets upgraded/output format slightly changed, for example by adding a new tag, I'll have a very frustrating time debugging.

Answer 1

You can try read_fwf and read_csv :

import pandas as pd
import io

temp=u"""Total Reads                   38948036
Total Tags                    49242267
Total Assigned Tags           44506208
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb
CDS_Exons           34175771            24133928            706.17
5'UTR_Exons         6341914             1366084             215.41
3'UTR_Exons         24930397            8269466             331.70
Introns             929421174           8172570             8.79
TSS_up_1kb          19267668            1044739             54.22
TSS_up_5kb          87647060            1433110             16.35
TSS_up_10kb         159281339           1549571             9.73
TES_down_1kb        19416426            300476              15.48
TES_down_5kb        83322244            718139              8.62
TES_down_10kb       147880768           1014589             6.86
====================================================================="""

#after testing replace io.StringIO(temp) to filename
df1 = pd.read_fwf(io.StringIO(temp), 
                 widths=[30,8], #widths of columns                  
                 nrows=3, #read only first 3 rows
                 index_col=[0], #set first column to index
                 names=[None, 0]) #set column names to None and 0

print (df1)
                            0
Total Reads          38948036
Total Tags           49242267
Total Assigned Tags  44506208

print (df1.T)
   Total Reads  Total Tags  Total Assigned Tags
0     38948036    49242267             44506208

#after testing replace io.StringIO(temp) to filename
df2 = pd.read_csv(io.StringIO(temp), 
                 sep="\s+", #separator is arbitrary whitespace
                 skiprows=4, #skip first 4 rows
                 comment='=') #skip all rows with first char =

print (df2)
           Group  Total_bases  Tag_count  Tags/Kb
0      CDS_Exons     34175771   24133928   706.17
1    5'UTR_Exons      6341914    1366084   215.41
2    3'UTR_Exons     24930397    8269466   331.70
3        Introns    929421174    8172570     8.79
4     TSS_up_1kb     19267668    1044739    54.22
5     TSS_up_5kb     87647060    1433110    16.35
6    TSS_up_10kb    159281339    1549571     9.73
7   TES_down_1kb     19416426     300476    15.48
8   TES_down_5kb     83322244     718139     8.62
9  TES_down_10kb    147880768    1014589     6.86

If width of first columns is not always [30,8] , use:

#after testing replace io.StringIO(temp) to filename
df1 = pd.read_csv(io.StringIO(temp), 
                 nrows=3, #skip first 3 rows
                 sep="\s\s+", #separator is 2 or more arbitrary whitespaces
                 engine="python", #clean ParserWarning
                 index_col=0, #set first column to index
                 header=None, #no header
                 names=[None, 0]) #set columns names to None (no index name) and 0

print (df1)
                            0
Total Reads          38948036
Total Tags           49242267
Total Assigned Tags  44506208

print (df1.T)
   Total Reads  Total Tags  Total Assigned Tags
0     38948036    49242267             44506208

What's the best way to read the following log file (rseqc ouput) into python (3)?

Question

1 answers

solution1
1 ACCPTED 2016-03-10 06:22:00

What's the best way to read the following log file (rseqc ouput) into python (3)?

Question

1 answers

solution1 1 ACCPTED 2016-03-10 06:22:00

solution1
1 ACCPTED 2016-03-10 06:22:00