[英]Python, Pandas Dataframe - creating a derived column with conditional statements
I have a Pandas DataFrame with 50 columns and 50k rows.我有一个有 50 列和 50k 行的 Pandas DataFrame。 There is one column with measurement data that needs correcting with a calibration factor.有一列包含需要使用校准因子校正的测量数据。 The factor is an integer value to be added or substracted.因子是要相加或相减的整数值。 There are multiple (10ish) measurements in the same column of measurement data ['T_calibrated'], they all have an unique serial number in a seperate column ['serial']在同一列测量数据 ['T_calibrated'] 中有多个(10ish)测量,它们在单独的列 ['serial'] 中都有唯一的序列号
I can calibrate a single sensor as follows using .where:我可以使用 .where 校准单个传感器,如下所示:
data['T_calibrated'] = data['T_uncalibrated'].where(data['serial'] == 12345)-2.7
12345 is the unique serial number -2.7 is the calibration factor. 12345 是唯一的序列号 -2.7 是校准系数。
How would I write this in a more generic form so that I can add the unique calibration factor associated with each serial number and add this all as a single combined column ['T_calibrated'].我将如何以更通用的形式编写它,以便我可以添加与每个序列号相关的唯一校准因子,并将其添加为单个组合列 ['T_calibrated']。 So far I'm getting stuck with brute force ways.到目前为止,我一直在使用蛮力方法。 I'm sure there must be some very elegant way to do this.我相信一定有一些非常优雅的方式来做到这一点。
I have a second dataframe with the serial number and calibration factor that can be looped or compared with ofcourse.我有一个带有序列号和校准因子的第二个数据帧,可以循环或与当然进行比较。
Close after posting my question I saw the light.发布我的问题后关闭,我看到了曙光。
I joined the two dataframes on the serial numbers preserving the original index of the original (because I want that).我在保留原始索引的序列号上加入了两个数据帧(因为我想要那个)。 Then I created another column just subtracting the two values.然后我创建了另一列,只是减去这两个值。 I didn't know how to add "inplace=True" with the join statement.我不知道如何在 join 语句中添加“inplace=True”。
Here's my code:这是我的代码:
calibrated_data=data.join(calibration_dataframe.set_index('serial'),on='serial')
calibrated_data['T_calibrated'] = calibrated_data.T_uncalibrated - calibrated_data.calibration_factor
You describe two data frames structured as below.您描述了两个结构如下的数据框。 Simplest approach is to merge them then calculate required column from merged data frame.最简单的方法是合并它们,然后从合并的数据框中计算所需的列。
import numpy as np
serial = [f"{a}{ord(a)}" for a in list("abcdef")]
df = pd.DataFrame({"serial":np.random.choice(serial, 50), "T_uncalibrated":np.random.randint(20,30,50)})
dfs = pd.DataFrame({"serial":serial, "calibration":np.random.randint(-2,2,len(serial))})
df.merge(dfs, on="serial").assign(T_calibrated=lambda d: d["T_uncalibrated"]+d["calibration"])
serial连续剧 | T_uncalibrated T_未校准 | calibration校准 | T_calibrated T_校准 |
---|---|---|---|
c99 c99 | 20 20 | -2 -2 | 18 18 |
c99 c99 | 27 27 | -2 -2 | 25 25 |
c99 c99 | 28 28 | -2 -2 | 26 26 |
c99 c99 | 28 28 | -2 -2 | 26 26 |
c99 c99 | 20 20 | -2 -2 | 18 18 |
c99 c99 | 22 22 | -2 -2 | 20 20 |
c99 c99 | 24 24 | -2 -2 | 22 22 |
c99 c99 | 24 24 | -2 -2 | 22 22 |
d100 d100 | 21 21 | -1 -1 | 20 20 |
d100 d100 | 26 26 | -1 -1 | 25 25 |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.