[英]python library for large tab/comma delimited text file
I have some big genomic data files to analyze, which come in two forms, one individual dosage file like this: 我有一些大的基因组数据文件要分析,它有两种形式,一个单独的剂量文件,如下所示:
id snp1 snp2 snp3 snp4 snp5 snp6
RS1->1000001 DOSE 1.994 1.998 1.998 1.998 1.830 1.335
RS1->1000002 DOSE 1.291 1.998 1.998 1.998 1.830 1.335
RS1->100001 DOSE 1.992 1.998 1.998 1.998 1.830 1.335
RS1->100002 DOSE 1.394 1.998 1.998 1.998 1.830 1.335
RS1->10001 DOSE 1.994 1.998 1.998 1.998 1.830 1.335
RS1->1001001 DOSE 1.904 1.998 1.998 1.998 1.830 1.335
RS1->1002001 DOSE 1.094 1.998 1.998 1.998 1.830 1.335
RS1->1003001 DOSE 1.994 1.998 1.998 1.998 1.830 1.335
RS1->1004001 DOSE 1.994 1.998 1.998 1.998 1.830 1.335
RS1->1005002 DOSE 1.994 1.998 1.998 1.998 1.830 1.335
The other contains some summary info: 另一个包含一些摘要信息:
SNP Al1 Al2 Freq1 MAF Quality Rsq
22_16050607 G A 0.99699 0.00301 0.99699 0.00000
22_16050650 C T 0.99900 0.00100 0.99900 0.00000
22_16051065 G A 0.99900 0.00100 0.99900 0.00000
22_16051134 A G 0.99900 0.00100 0.99900 0.00000
rs62224609 T C 0.91483 0.08517 0.91483 -0.00000
rs62224610 G C 0.66733 0.33267 0.66733 0.00000
22_16051477 C A 0.99399 0.00601 0.99399 -0.00000
22_16051493 G A 0.99900 0.00100 0.99900 -0.00000
22_16051497 A G 0.64529 0.35471 0.64529 0.00000
The SNP column in the second file corresponds the snp1, snp2... in the first file. 第二个文件中的SNP列对应第一个文件中的snp1,snp2 .... I need to use the summary info in the second file to do some quality check and selection, then apply some statistical analysis on the data in the first file accordingly. 我需要使用第二个文件中的摘要信息进行一些质量检查和选择,然后对第一个文件中的数据进行相应的统计分析。
The question is, is there a python library suitable for this task? 问题是,是否有适合此任务的python库? Performance is vital here, because these are really huge files. 性能在这里至关重要,因为这些文件非常庞大。 Thanks! 谢谢!
For dealing with large files and data with high performance and efficient manipulation, there is really no better module than pandas 为了处理具有高性能和高效操作的大型文件和数据,实际上没有比熊猫更好的模块
The following code will read your file into a DataFrame
and allow easy manipulation: 以下代码将您的文件读入DataFrame
并允许轻松操作:
import pandas as pd
data = 'my_data.csv'
df = pd.read_csv(data)
now df
is an efficient dataframe containing your data! 现在df
是一个包含数据的高效数据框! Also, you don't even need to say it's tab delimiter because pandas "sniffs" for the delimiter 此外,你甚至不需要说它的制表符分隔符,因为pandas“嗅探”分隔符
There is the csv
module. 有csv
模块。 It's written with a C
backend, so it should perform pretty well. 它是用C
后端编写的,所以它应该表现得很好。 That said, str.split
might be even faster if the format is simple enough. 也就是说,如果格式足够简单, str.split
可能会更快 。
It seems to me that rather than using a CSV file to store the data, some sort of database is probably an even better bet. 在我看来,不是使用CSV文件来存储数据,某种数据库可能是更好的选择。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.