[英]How can I normalize data in a pandas dataframe to the starting value of a time series?
I would like to analyze a dataset from a clinical study using pandas.我想使用 pandas 分析来自临床研究的数据集。 Patients come at different visits to the clinic and some parameters are measured.患者在不同的诊所就诊,并测量了一些参数。 I would like to normalize the bloodparameters to the values of the first visit (baseline values), ie: Normalized = Parameter[Visit X] / Parameter[Visit 1].我想将血液参数标准化为第一次访问的值(基线值),即:Normalized = Parameter[Visit X] / Parameter[Visit 1]。 The dataset looks roughly like the following example:数据集大致类似于以下示例:
import pandas as pd
import numpy as np
rng = np.random.RandomState(0)
df = pd.DataFrame({'Patient': ['A','A','A','B','B','B','C','C','C'],
'Visit': [1,2,3,1,2,3,1,2,3],
'Parameter': rng.randint(0, 100, 9)},
columns = ['Patient', 'Visit', 'Parameter'])
df
Patient Visit Parameter
0 A 1 44
1 A 2 47
2 A 3 64
3 B 1 67
4 B 2 67
5 B 3 9
6 C 1 83
7 C 2 21
8 C 3 36
Now I would like to add a column that includes each parameter normalized to the baseline value, ie the value at Visit 1. The simplest thing would be to add a column, which contains only the Visit 1 value for each patient and then simply divide the parameter column by this added column.现在我想添加一列,其中包含标准化为基线值的每个参数,即就诊 1 的值。最简单的方法是添加一列,该列仅包含每位患者的就诊 1 值,然后简单地将此添加列的参数列。 However I fail to create such a column, which would add the baseline value for each respective patient.但是,我无法创建这样一个列,它将为每个患者添加基线值。 But maybe there are also one-line solutions without adding another column.但也许也有不添加另一列的单行解决方案。
The result should look like this:结果应如下所示:
Patient Visit Parameter Normalized
0 A 1 44 1.0
1 A 2 47 1.07
2 A 3 64 1.45
3 B 1 67 1.0
4 B 2 67 1.0
5 B 3 9 0.13
6 C 1 83 1.0
7 C 2 21 0.25
8 C 3 36 0.43
IIUC, GroupBy.transform
IIUC, GroupBy.transform
df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first'))
print(df)
Patient Visit Parameter Normalized
0 A 1 44 1.000000
1 A 2 47 1.068182
2 A 3 64 1.454545
3 B 1 67 1.000000
4 B 2 67 1.000000
5 B 3 9 0.134328
6 C 1 83 1.000000
7 C 2 21 0.253012
8 C 3 36 0.433735
df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first')).round(2)
print(df)
Patient Visit Parameter Normalized
0 A 1 44 1.00
1 A 2 47 1.07
2 A 3 64 1.45
3 B 1 67 1.00
4 B 2 67 1.00
5 B 3 9 0.13
6 C 1 83 1.00
7 C 2 21 0.25
8 C 3 36 0.43
If you need create a new DataFrame:如果您需要创建一个新的 DataFrame:
df2 = df.assign(Normalized = df['Parameter'].div(df.groupby('Patient')['Parameter'].transform('first')))
We could also use lambda as I suggested.我们也可以按照我的建议使用lambda 。
Or:或者:
df2 = df.copy()
df2['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
.transform('first'))
What @ansev said: GroupBy.transform
@ansev 说什么: GroupBy.transform
If you wish to preserve the Parameter
column, just run the last line he wrote but with Normalized
instead of Parameter
as the new column name:如果您希望保留Parameter
列,只需运行他编写的最后一行,但使用Normalized
而不是Parameter
作为新列名:
df = df.assign(Normalized = lambda x: x['Parameter'].div(x.groupby('Patient')['Parameter'].transform('first')))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.