简体   繁体   English

将以下日志文​​件(rseqc输出)读入python(3)的最佳方法是什么?

[英]What's the best way to read the following log file (rseqc ouput) into python (3)?

I have 100s of the the following log files, which I'd like to load into two pandas DataFrame(s) per dataset (or a DF and a dictionary or some other combination). 我有以下日志文​​件的100个,我想将每个数据集(或DF和字典或​​其他组合)加载到两个pandas DataFrame中。

What is the most efficient way to parse this file into python? 将这个文件解析为python的最有效方法是什么?

Total Reads                   38948036
Total Tags                    49242267
Total Assigned Tags           44506208
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb
CDS_Exons           34175771            24133928            706.17
5'UTR_Exons         6341914             1366084             215.41
3'UTR_Exons         24930397            8269466             331.70
Introns             929421174           8172570             8.79
TSS_up_1kb          19267668            1044739             54.22
TSS_up_5kb          87647060            1433110             16.35
TSS_up_10kb         159281339           1549571             9.73
TES_down_1kb        19416426            300476              15.48
TES_down_5kb        83322244            718139              8.62
TES_down_10kb       147880768           1014589             6.86
=====================================================================

Obviously, the top three lines have parameter name/value, while the bottom section has group/total bases/tag count/tags per kb. 显然,前三行具有参数名称/值,而下半部分具有组/总碱基/标签计数/每kb标签。 All of these will always exist, and be numeric, in all of my datasets, so robust NA control is not neccessary. 所有这些数据在我的所有数据集中都将始终存在并为数字,因此不需要强大的NA控制。

At the moment, I'm parsing the file into a nested list (one per dataset ie file), stripping the whitespace, and pulling out the values by index from the list - the challenge is that if the tool that's generating the file gets upgraded/output format slightly changed, for example by adding a new tag, I'll have a very frustrating time debugging. 目前,我正在将文件解析为嵌套列表(每个数据集即文件一个),剥离空格,然后从列表中按索引提取值-挑战在于,如果生成文件的工具已升级/ output格式稍有变化,例如,通过添加新标签,我将在调试过程中感到非常沮丧。

You can try read_fwf and read_csv : 您可以尝试read_fwfread_csv

import pandas as pd
import io

temp=u"""Total Reads                   38948036
Total Tags                    49242267
Total Assigned Tags           44506208
=====================================================================
Group               Total_bases         Tag_count           Tags/Kb
CDS_Exons           34175771            24133928            706.17
5'UTR_Exons         6341914             1366084             215.41
3'UTR_Exons         24930397            8269466             331.70
Introns             929421174           8172570             8.79
TSS_up_1kb          19267668            1044739             54.22
TSS_up_5kb          87647060            1433110             16.35
TSS_up_10kb         159281339           1549571             9.73
TES_down_1kb        19416426            300476              15.48
TES_down_5kb        83322244            718139              8.62
TES_down_10kb       147880768           1014589             6.86
====================================================================="""
#after testing replace io.StringIO(temp) to filename
df1 = pd.read_fwf(io.StringIO(temp), 
                 widths=[30,8], #widths of columns                  
                 nrows=3, #read only first 3 rows
                 index_col=[0], #set first column to index
                 names=[None, 0]) #set column names to None and 0

print (df1)
                            0
Total Reads          38948036
Total Tags           49242267
Total Assigned Tags  44506208

print (df1.T)
   Total Reads  Total Tags  Total Assigned Tags
0     38948036    49242267             44506208

#after testing replace io.StringIO(temp) to filename
df2 = pd.read_csv(io.StringIO(temp), 
                 sep="\s+", #separator is arbitrary whitespace
                 skiprows=4, #skip first 4 rows
                 comment='=') #skip all rows with first char =

print (df2)
           Group  Total_bases  Tag_count  Tags/Kb
0      CDS_Exons     34175771   24133928   706.17
1    5'UTR_Exons      6341914    1366084   215.41
2    3'UTR_Exons     24930397    8269466   331.70
3        Introns    929421174    8172570     8.79
4     TSS_up_1kb     19267668    1044739    54.22
5     TSS_up_5kb     87647060    1433110    16.35
6    TSS_up_10kb    159281339    1549571     9.73
7   TES_down_1kb     19416426     300476    15.48
8   TES_down_5kb     83322244     718139     8.62
9  TES_down_10kb    147880768    1014589     6.86

If width of first columns is not always [30,8] , use: 如果第一列的宽度并不总是[30,8] ,请使用:

#after testing replace io.StringIO(temp) to filename
df1 = pd.read_csv(io.StringIO(temp), 
                 nrows=3, #skip first 3 rows
                 sep="\s\s+", #separator is 2 or more arbitrary whitespaces
                 engine="python", #clean ParserWarning
                 index_col=0, #set first column to index
                 header=None, #no header
                 names=[None, 0]) #set columns names to None (no index name) and 0

print (df1)
                            0
Total Reads          38948036
Total Tags           49242267
Total Assigned Tags  44506208

print (df1.T)
   Total Reads  Total Tags  Total Assigned Tags
0     38948036    49242267             44506208

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Python中,将异常记录到文件的最佳方法是什么? - In Python, what's the best way to log exceptions to a file? 使用python以下格式解析文件的最佳方法(防错/万无一失)是什么? - What's the best way(error proof / foolproof) to parse a file using python with following format? 我有一个包含文本和计算的 excel 文件,在 python 中读取 excel 文件的最佳方法是什么? - I have an excel file that has texts and calculations, what's the best way to read the excel file in python? 将日志文件解析为python列表的最佳方法是什么? - What is the best way to parse a log file into a python list? 使用Python读取csv文件的第i列的最佳方法是什么? - What is the best way to read the ith column of a csv file with Python? 在Python中处理Boost INFO文件的最佳方法是什么 - What's the best way to manipulate Boost INFO file in Python 纠正解析日志文件的最佳方法是什么? - What is the best way to correct parse log file? 用Django / python在Amazon s3中转换文件的最佳方法是什么? - What is the best way to convert a file in amazon s3 with Django/python? 在python中使用核心文件的最佳方法是什么? - What's the best way to work with core file in python? 在 python 源文件中对类定义进行排序的最佳方法是什么? - What's the best way to sort class definitions in a python source file?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM