在 pandas 数据帧上拆分和执行 function 的最有效方法

Question

I have been given a data frame that contains two measurements of a value (A and B) in rows and each column represents the measurements for sample.我得到了一个数据框，其中包含行中的值（A 和 B）的两个测量值，每列代表样本的测量值。

Example below:下面的例子：

ID S1 S2 S3
M1_A 1 2 3 
M1_B 3 2 1
M2_A 1 2 3 
M2_B 3 2 1

I need to calculate the ratio of B to A+B [ie (B/(A+B))] for each measurement each sample.我需要为每个样本的每次测量计算 B 与 A+B 的比率 [即 (B/(A+B))]。

Result data frame example:结果数据框示例：

ID S1 S2 S3
M1 0.75 0.5 .25 
M2 0.75 0.5 .25

Currently I am reading in the file two lines at a time, checking that the ID's match (excluding the _A or _B), transforming the "lines" to vectors and then performing the calculation across to vector.目前，我一次读取文件两行，检查 ID 是否匹配（不包括 _A 或 _B），将“行”转换为向量，然后执行计算到向量。 On larger sample sets this gets extremely slow.在较大的样本集上，这变得非常缓慢。

What is the most efficient way to do this using a library such as pandas?使用 pandas 之类的库最有效的方法是什么？

All help appreciated!所有帮助表示赞赏！

Answer 1

This sounds like a classic groupby-aggregate problem.这听起来像是一个经典的 groupby-aggregate 问题。 Pandas can handle the underscore in the ID column easily as well. Pandas 也可以轻松处理 ID 列中的下划线。

df['ID'] = df['ID'].str.split('_').str[0]
df = df.groupby('ID').agg(lambda x: x.values[-1]/x.sum())
print(df)

      S1   S2    S3
ID                 
M1  0.75  0.5  0.25
M2  0.75  0.5  0.25

在 pandas 数据帧上拆分和执行 function 的最有效方法

问题描述

1 个解决方案

解决方案1
1 2019-10-18 00:39:07

在 pandas 数据帧上拆分和执行 function 的最有效方法

问题描述

1 个解决方案

解决方案1 1 2019-10-18 00:39:07

解决方案1
1 2019-10-18 00:39:07