简体   繁体   English

保留熊猫中的滋扰菌群

[英]Retaining nuisence coliumns in pandas groupby

I have a moderate sized data set I am processing with pandas. 我正在使用熊猫处理中等大小的数据集。 It has around 600,000 rows. 它大约有60万行。

It has three "id" variables: "gene_id", "gene_name" and "transcript_id", and then a number of numerical columns which are determined at run-time. 它具有三个“ id”变量:“ gene_id”,“ gene_name”和“ transcript_id”,然后是在运行时确定的许多数字列。

In [129]: df.head().todict()
{u'utr3_count': {8: 2.0, 30: 1.0, 29: 2.0, 6: 2.0, 7: 2.0}, 
 u'gene_id': {8: u'ENSG00000188157', 30: u'ENSG00000160087', 29: u'ENSG00000176022', 6: u'ENSG00000188157', 7: u'ENSG00000188157'}, 
 u'utr3_enrichment': {8: 2.1449912126499999, 30: 1.14988290398, 29: 1.0484234234200001, 6: 2.1449912126499999, 7: 2.1449912126499999},
 u'transcript_id': {8: u'ENST00000379370', 30: u'ENST00000450390', 29: u'ENST00000379198', 6: u'ENST00000379370', 7: u'ENST00000379370'},
 u'expression': {8: 0.13387876534027521, 30: 0.514855687606112, 29: 0.79126387397064091, 6: 0.13387876534027521, 7: 0.13387876534027521}, 
 u'gene_name': {8: u'AGRN', 30: u'UBE2J2', 29: u'B3GALT6', 6: u'AGRN', 7: u'AGRN'}}

I want to get the mean of the replicates for each "transcript_id". 我想获取每个“ transcript_id”的重复平均值。 But doing so by grouping on "transcript_id" means that I lose the information on "gene_id" and "gene_name" as they are classed as nuisance columns. 但是,通过对“ transcript_id”进行分组来进行此操作意味着我会丢失“ gene_id”和“ gene_name”上的信息,因为它们被分类为令人讨厌的列。

If I group on all three columns, I immediately get MemoryError, even on a big box (128GB), as pandas tries to do the calculation for every combination of the values in the three columns, even though this is definitely not necessary: each "transcript_id" maps to a single "gene_id" and a single "gene_name". 如果我对所有三列进行分组,即使在一个大盒子(128GB)上,我也会立即得到MemoryError,因为熊猫试图对三列中值的每种组合进行计算,即使绝对没有必要:每个“ transcript_id”映射到单个“ gene_id”和单个“ gene_name”。

Is there a way to do the groupby just on transcript_id without losing the information in the other columns? 有没有一种方法可以只对transcript_id进行分组,而不会丢失其他列中的信息?

Simple Solution: 简单的解决方案:

Store transcript_id , gene_id and gene_name in a separate DataFrame (say metadata ): transcript_idgene_idgene_name在单独的DataFrame中(例如metadata ):

metadata = df[['transcript_id', 'gene_id', 'gene_name']].copy()
# .copy() is important!

groupby on transcript_id as you do now, and perform your calculations (say agg_df ). 像现在一样在transcript_id groupby ,然后执行计算(例如agg_df )。 After they are done, merge the two frames together: 完成后,将两个框架合并在一起:

pd.merge(agg_df, metadata, how='left', on='transcript_id)

This works because 这行得通,因为

... each "transcript_id" maps to a single "gene_id" and a single "gene_name" ...每个“ transcript_id”都映射到一个“ gene_id”和一个“ gene_name”


Alternate Solution: 替代解决方案:

Read the file (assuming you are reading from csv) in chunks using pd.read_csv(file_path, chunksize = <some integer, say 5e4>) . 使用pd.read_csv(file_path, chunksize = <some integer, say 5e4>)读取文件(假设您正在从csv中读取pd.read_csv(file_path, chunksize = <some integer, say 5e4>) groupby on all three columns, (you won't run into MemoryError now because you are only reading part of the data) and keep running totals and running counts. 在所有三列上进行groupby (由于只读取部分数据,您现在不会遇到MemoryError )并保持运行总计和运行计数。 Divide the totals by the counts at the end. 将总数除以最后的计数。 Pseudo code: 伪代码:

totals = pd.DataFrame()
counts = pd.DataFrame()
df = pd.read_csv(file_path, chunksize=5e4)
for chunk in df:
    grouped = chunk.groupby(['transcript_id', 'gene_id', 'gene_name'])
    totals = totals.add(grouped.sum())
    counts = counts.add(grouped.count())
means = totals/counts

This will work as long as you need some measure that can be computed in bits and pieces, like sums, counts, products, cumulative sums and products. 只要您需要一些可以逐点计算的度量,例如和,计数,乘积,累计和和乘积,这将起作用。 But anything like percentiles or 但是诸如百分位数或


Another solution (slightly harder): Merge the columns transcript_id , gene_id and gene_name in another column, say merged_id and groupby on merged_id . 另一个解决方案(稍难一点):transcript_idgene_idgene_name列合并到另一列中,在merged_id上说merged_idgroupby Split the column up into the components at the end of your calculations. 计算结束时,将该列拆分为各个组件。


Ps. PS。 I recommend using the Simple Solution. 我建议使用简单解决方案。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM