I have 100s of the the following log files, which I'd like to load into two pandas DataFrame(s) per dataset (or a DF and a dictionary or some other combination).
What is the most efficient way to parse this file into python?
Total Reads 38948036 Total Tags 49242267 Total Assigned Tags 44506208 ===================================================================== Group Total_bases Tag_count Tags/Kb CDS_Exons 34175771 24133928 706.17 5'UTR_Exons 6341914 1366084 215.41 3'UTR_Exons 24930397 8269466 331.70 Introns 929421174 8172570 8.79 TSS_up_1kb 19267668 1044739 54.22 TSS_up_5kb 87647060 1433110 16.35 TSS_up_10kb 159281339 1549571 9.73 TES_down_1kb 19416426 300476 15.48 TES_down_5kb 83322244 718139 8.62 TES_down_10kb 147880768 1014589 6.86 =====================================================================
Obviously, the top three lines have parameter name/value, while the bottom section has group/total bases/tag count/tags per kb. All of these will always exist, and be numeric, in all of my datasets, so robust NA control is not neccessary.
At the moment, I'm parsing the file into a nested list (one per dataset ie file), stripping the whitespace, and pulling out the values by index from the list - the challenge is that if the tool that's generating the file gets upgraded/output format slightly changed, for example by adding a new tag, I'll have a very frustrating time debugging.
You can try read_fwf
and read_csv
:
import pandas as pd
import io
temp=u"""Total Reads 38948036
Total Tags 49242267
Total Assigned Tags 44506208
=====================================================================
Group Total_bases Tag_count Tags/Kb
CDS_Exons 34175771 24133928 706.17
5'UTR_Exons 6341914 1366084 215.41
3'UTR_Exons 24930397 8269466 331.70
Introns 929421174 8172570 8.79
TSS_up_1kb 19267668 1044739 54.22
TSS_up_5kb 87647060 1433110 16.35
TSS_up_10kb 159281339 1549571 9.73
TES_down_1kb 19416426 300476 15.48
TES_down_5kb 83322244 718139 8.62
TES_down_10kb 147880768 1014589 6.86
====================================================================="""
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_fwf(io.StringIO(temp),
widths=[30,8], #widths of columns
nrows=3, #read only first 3 rows
index_col=[0], #set first column to index
names=[None, 0]) #set column names to None and 0
print (df1)
0
Total Reads 38948036
Total Tags 49242267
Total Assigned Tags 44506208
print (df1.T)
Total Reads Total Tags Total Assigned Tags
0 38948036 49242267 44506208
#after testing replace io.StringIO(temp) to filename
df2 = pd.read_csv(io.StringIO(temp),
sep="\s+", #separator is arbitrary whitespace
skiprows=4, #skip first 4 rows
comment='=') #skip all rows with first char =
print (df2)
Group Total_bases Tag_count Tags/Kb
0 CDS_Exons 34175771 24133928 706.17
1 5'UTR_Exons 6341914 1366084 215.41
2 3'UTR_Exons 24930397 8269466 331.70
3 Introns 929421174 8172570 8.79
4 TSS_up_1kb 19267668 1044739 54.22
5 TSS_up_5kb 87647060 1433110 16.35
6 TSS_up_10kb 159281339 1549571 9.73
7 TES_down_1kb 19416426 300476 15.48
8 TES_down_5kb 83322244 718139 8.62
9 TES_down_10kb 147880768 1014589 6.86
If width of first columns is not always [30,8]
, use:
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_csv(io.StringIO(temp),
nrows=3, #skip first 3 rows
sep="\s\s+", #separator is 2 or more arbitrary whitespaces
engine="python", #clean ParserWarning
index_col=0, #set first column to index
header=None, #no header
names=[None, 0]) #set columns names to None (no index name) and 0
print (df1)
0
Total Reads 38948036
Total Tags 49242267
Total Assigned Tags 44506208
print (df1.T)
Total Reads Total Tags Total Assigned Tags
0 38948036 49242267 44506208
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.