用于大型制表符/逗号分隔文本文件的python库

Question

I have some big genomic data files to analyze, which come in two forms, one individual dosage file like this: 我有一些大的基因组数据文件要分析，它有两种形式，一个单独的剂量文件，如下所示：

id                      snp1    snp2    snp3    snp4    snp5    snp6
RS1->1000001    DOSE    1.994   1.998   1.998   1.998   1.830   1.335
RS1->1000002    DOSE    1.291   1.998   1.998   1.998   1.830   1.335
RS1->100001     DOSE    1.992   1.998   1.998   1.998   1.830   1.335
RS1->100002     DOSE    1.394   1.998   1.998   1.998   1.830   1.335
RS1->10001      DOSE    1.994   1.998   1.998   1.998   1.830   1.335
RS1->1001001    DOSE    1.904   1.998   1.998   1.998   1.830   1.335
RS1->1002001    DOSE    1.094   1.998   1.998   1.998   1.830   1.335
RS1->1003001    DOSE    1.994   1.998   1.998   1.998   1.830   1.335
RS1->1004001    DOSE    1.994   1.998   1.998   1.998   1.830   1.335
RS1->1005002    DOSE    1.994   1.998   1.998   1.998   1.830   1.335

The other contains some summary info: 另一个包含一些摘要信息：

SNP         Al1 Al2 Freq1   MAF     Quality Rsq 
22_16050607 G   A   0.99699 0.00301 0.99699 0.00000
22_16050650 C   T   0.99900 0.00100 0.99900 0.00000
22_16051065 G   A   0.99900 0.00100 0.99900 0.00000
22_16051134 A   G   0.99900 0.00100 0.99900 0.00000
rs62224609  T   C   0.91483 0.08517 0.91483 -0.00000
rs62224610  G   C   0.66733 0.33267 0.66733 0.00000
22_16051477 C   A   0.99399 0.00601 0.99399 -0.00000
22_16051493 G   A   0.99900 0.00100 0.99900 -0.00000
22_16051497 A   G   0.64529 0.35471 0.64529 0.00000

The SNP column in the second file corresponds the snp1, snp2... in the first file. 第二个文件中的SNP列对应第一个文件中的snp1，snp2 .... I need to use the summary info in the second file to do some quality check and selection, then apply some statistical analysis on the data in the first file accordingly. 我需要使用第二个文件中的摘要信息进行一些质量检查和选择，然后对第一个文件中的数据进行相应的统计分析。

The question is, is there a python library suitable for this task? 问题是，是否有适合此任务的python库？ Performance is vital here, because these are really huge files. 性能在这里至关重要，因为这些文件非常庞大。 Thanks! 谢谢！

Answer 1

For dealing with large files and data with high performance and efficient manipulation, there is really no better module than pandas 为了处理具有高性能和高效操作的大型文件和数据，实际上没有比熊猫更好的模块

The following code will read your file into a DataFrame and allow easy manipulation: 以下代码将您的文件读入DataFrame并允许轻松操作：

import pandas as pd
data = 'my_data.csv'
df = pd.read_csv(data)

now df is an efficient dataframe containing your data! 现在df是一个包含数据的高效数据框！ Also, you don't even need to say it's tab delimiter because pandas "sniffs" for the delimiter 此外，你甚至不需要说它的制表符分隔符，因为pandas“嗅探”分隔符

Answer 2

There is the csv module. 有csv模块。 It's written with a C backend, so it should perform pretty well. 它是用C后端编写的，所以它应该表现得很好。 That said, str.split might be even faster if the format is simple enough. 也就是说，如果格式足够简单， str.split可能会更快。

It seems to me that rather than using a CSV file to store the data, some sort of database is probably an even better bet. 在我看来，不是使用CSV文件来存储数据，某种数据库可能是更好的选择。

用于大型制表符/逗号分隔文本文件的python库

问题描述

2 个解决方案

解决方案1
2 2013-05-08 15:37:27

解决方案2
1 2013-05-08 15:37:37

用于大型制表符/逗号分隔文本文件的python库

问题描述

2 个解决方案

解决方案1 2 2013-05-08 15:37:27

解决方案2 1 2013-05-08 15:37:37

解决方案1
2 2013-05-08 15:37:27

解决方案2
1 2013-05-08 15:37:37