简体   繁体   中英

How can I normalize data in a pandas dataframe to the starting value of a time series?

I would like to analyze a dataset from a clinical study using pandas. Patients come at different visits to the clinic and some parameters are measured. I would like to normalize the bloodparameters to the values of the first visit (baseline values), ie: Normalized = Parameter[Visit X] / Parameter[Visit 1]. The dataset looks roughly like the following example:

    import pandas as pd 
    import numpy as np
    rng = np.random.RandomState(0)
    df = pd.DataFrame({'Patient': ['A','A','A','B','B','B','C','C','C'],
                       'Visit': [1,2,3,1,2,3,1,2,3],
                       'Parameter': rng.randint(0, 100, 9)},
                       columns = ['Patient', 'Visit', 'Parameter'])
    df

       Patient  Visit   Parameter
    0    A        1     44
    1    A        2     47
    2    A        3     64
    3    B        1     67
    4    B        2     67
    5    B        3     9
    6    C        1     83
    7    C        2     21
    8    C        3     36 

Now I would like to add a column that includes each parameter normalized to the baseline value, ie the value at Visit 1. The simplest thing would be to add a column, which contains only the Visit 1 value for each patient and then simply divide the parameter column by this added column. However I fail to create such a column, which would add the baseline value for each respective patient. But maybe there are also one-line solutions without adding another column.

The result should look like this:

       Patient  Visit   Parameter Normalized
    0    A        1     44          1.0
    1    A        2     47          1.07
    2    A        3     64          1.45
    3    B        1     67          1.0
    4    B        2     67          1.0
    5    B        3     9           0.13
    6    C        1     83          1.0
    7    C        2     21          0.25
    8    C        3     36          0.43  

IIUC, GroupBy.transform

df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
                                         .transform('first'))
print(df)
  Patient  Visit  Parameter  Normalized
0       A      1         44    1.000000
1       A      2         47    1.068182
2       A      3         64    1.454545
3       B      1         67    1.000000
4       B      2         67    1.000000
5       B      3          9    0.134328
6       C      1         83    1.000000
7       C      2         21    0.253012
8       C      3         36    0.433735

df['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
                                         .transform('first')).round(2)
print(df)
  Patient  Visit  Parameter  Normalized
0       A      1         44        1.00
1       A      2         47        1.07
2       A      3         64        1.45
3       B      1         67        1.00
4       B      2         67        1.00
5       B      3          9        0.13
6       C      1         83        1.00
7       C      2         21        0.25
8       C      3         36        0.43

If you need create a new DataFrame:

df2 = df.assign(Normalized = df['Parameter'].div(df.groupby('Patient')['Parameter'].transform('first')))

We could also use lambda as I suggested.

Or:

df2 = df.copy()
df2['Normalized'] = df['Parameter'].div(df.groupby('Patient')['Parameter']
                                         .transform('first'))

What @ansev said: GroupBy.transform

If you wish to preserve the Parameter column, just run the last line he wrote but with Normalized instead of Parameter as the new column name:

df = df.assign(Normalized = lambda x: x['Parameter'].div(x.groupby('Patient')['Parameter'].transform('first')))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM