[英]Most efficient way to split and perform function on pandas data frame
I have been given a data frame that contains two measurements of a value (A and B) in rows and each column represents the measurements for sample.我得到了一个数据框,其中包含行中的值(A 和 B)的两个测量值,每列代表样本的测量值。
Example below:下面的例子:
ID S1 S2 S3
M1_A 1 2 3
M1_B 3 2 1
M2_A 1 2 3
M2_B 3 2 1
I need to calculate the ratio of B to A+B [ie (B/(A+B))] for each measurement each sample.我需要为每个样本的每次测量计算 B 与 A+B 的比率 [即 (B/(A+B))]。
Result data frame example:结果数据框示例:
ID S1 S2 S3
M1 0.75 0.5 .25
M2 0.75 0.5 .25
Currently I am reading in the file two lines at a time, checking that the ID's match (excluding the _A or _B), transforming the "lines" to vectors and then performing the calculation across to vector.目前,我一次读取文件两行,检查 ID 是否匹配(不包括 _A 或 _B),将“行”转换为向量,然后执行计算到向量。 On larger sample sets this gets extremely slow.在较大的样本集上,这变得非常缓慢。
What is the most efficient way to do this using a library such as pandas?使用 pandas 之类的库最有效的方法是什么?
All help appreciated!所有帮助表示赞赏!
This sounds like a classic groupby-aggregate problem.这听起来像是一个经典的 groupby-aggregate 问题。 Pandas can handle the underscore in the ID column easily as well. Pandas 也可以轻松处理 ID 列中的下划线。
df['ID'] = df['ID'].str.split('_').str[0]
df = df.groupby('ID').agg(lambda x: x.values[-1]/x.sum())
print(df)
S1 S2 S3
ID
M1 0.75 0.5 0.25
M2 0.75 0.5 0.25
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.