[英]Conditional generation of new column - Pandas
我正在尝试根据现有列的条件逻辑创建一个新列。 我知道可能有更有效的方法来实现这一点,但我有一些条件需要包括在内。 这只是第一步。
整体 scope 是创建两个从1
和2
映射的新列。 这些被引用到Object
列,因为每个时间点我可以有多行。
Object2
和Value
确定如何 map 新列。 因此,如果Value is == X
,我想匹配两个Object
列,以将该时间点的相应1
和2
返回到新列。 如果Value is == Y
,则应该发生相同的过程。 如果Value is == Z
,我想插入0, 0
。 其他一切都应该是NaN
df = pd.DataFrame({
'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.3','2019-08-02 09:50:10.3','2019-08-02 09:50:10.4','2019-08-02 09:50:10.5','2019-08-02 09:50:10.6','2019-08-02 09:50:10.6'],
'Object' : ['B','A','A','A','C','C','C','B','B'],
'1' : [1,3,5,7,9,11,13,15,17],
'2' : [0,1,4,6,8,10,12,14,16],
'Object2' : ['A','A',np.nan,'C','C','C','C','B','A'],
'Value' : ['X','X',np.nan,'Y','Y','Y','Y','Z',np.nan],
})
def map_12(df):
for i in df['Value']:
if i == 'X':
df['A1'] = df['1']
df['A2'] = df['2']
elif i == 'Y':
df['A1'] = df['1']
df['A2'] = df['2']
elif i == 'Z':
df['A1'] = 0
df['A2'] = 0
else:
df['A1'] = np.nan
df['A2'] = np.nan
return df
预期 Output:
Time Object 1 2 Object2 Value A1 A2
0 2019-08-02 09:50:10.1 A 1 0 A X 1.0 0.0 # Match A-A at this time point, so output is 1,0
1 2019-08-02 09:50:10.1 B 3 1 A X 1.0 0.0 # Still at same time point so use 1,0
2 2019-08-02 09:50:10.2 A 5 4 NaN NaN NaN NaN # No Value so NaN
3 2019-08-02 09:50:10.3 C 7 6 C Y 7.0 6.0 # Match C-C at this time point, so output is 7,6
4 2019-08-02 09:50:10.3 A 9 8 C Y 7.0 6.0 # Still at same time point so use 7,6
5 2019-08-02 09:50:10.4 C 11 10 C Y 11.0 10.0 # Match C-C at this time point, so output is 11,10
6 2019-08-02 09:50:10.5 C 13 12 C Y 13.0 12.0 # Match C-C at this time point, so output is 13,12
7 2019-08-02 09:50:10.6 B 15 14 B Z 0.0 0.0 # Z so 0,0
8 2019-08-02 09:50:10.6 B 17 16 A NaN NaN NaN # No Value so NaN
新样本df:
df = pd.DataFrame({
'Time' : ['2019-08-02 09:50:10.1','2019-08-02 09:50:10.1','2019-08-02 09:50:10.2','2019-08-02 09:50:10.3','2019-08-02 09:50:10.3','2019-08-02 09:50:10.4','2019-08-02 09:50:10.5','2019-08-02 09:50:10.6','2019-08-02 09:50:10.6'],
'Object' : ['B','A','A','A','C','C','C','B','B'],
'1' : [1,3,5,7,9,11,13,15,17],
'2' : [0,1,4,6,8,10,12,14,16],
'Object2' : ['A','A',np.nan,'C','C','C','C','B','A'],
'Value' : ['X','X',np.nan,'Y','Y','Y','Y','Z',np.nan],
})
预期 Output:
Time Object 1 2 Object2 Value A1 A2
0 2019-08-02 09:50:10.1 B 1 0 A X 3.0 1.0 # Match A-A at this time point, so output is 3,1
1 2019-08-02 09:50:10.1 A 3 1 A X 3.0 1.0 # Still at same time point so use 3,1
2 2019-08-02 09:50:10.2 A 5 4 NaN NaN NaN NaN # No Value so NaN
3 2019-08-02 09:50:10.3 A 7 6 C Y 9.0 8.0 # Match C-C at this time point, so output is 9,8
4 2019-08-02 09:50:10.3 C 9 8 C Y 9.0 8.0 # Still at same time point so use 9,8
5 2019-08-02 09:50:10.4 C 11 10 C Y 11.0 10.0 # Match C-C at this time point, so output is 11,10
6 2019-08-02 09:50:10.5 C 13 12 C Y 13.0 12.0 # Match C-C at this time point, so output is 13,12
7 2019-08-02 09:50:10.6 B 15 14 B Z 0.0 0.0 # Z so 0,0
8 2019-08-02 09:50:10.6 B 17 16 A NaN NaN NaN # No Value so NaN
Use DataFrame.where
+ DataFrame.eq
to create a DataFrame similar to df[['1','2']]
but only with the rows where matches is True
and the rest with NaN
. 然后使用 DataFrame.groupby 按时间点分组,并用DataFrame.groupby
和Object2
( matches==True
) 重合的现有值Object
每组的缺失数据。 使用DataFrame.where
舍弃df['Value']
为NaN
的值。最后使用 [ DataFrame.mask
] 当Z
在列Value
中时设置为 0
#matches
matches=df.Object.eq(df.Object2)
#Creating conditions
condition_z=df['Value']=='Z'
not_null=df['Value'].notnull()
#Creating DataFrame to fill
df12=( df[['1','2']].where(matches)
.groupby(df['Time'],sort=False)
.apply(lambda x: x.ffill().bfill()) )
#fill 0 on Value is Z and discarting NaN
df[['A1','A2']] =df12.where(not_null).mask(condition_z,0)
print(df)
Output
Time Object 1 2 Object2 Value A1 A2
0 2019-08-02 09:50:10.1 B 1 0 A X 3.0 1.0
1 2019-08-02 09:50:10.1 A 3 1 A X 3.0 1.0
2 2019-08-02 09:50:10.2 A 5 4 NaN NaN NaN NaN
3 2019-08-02 09:50:10.3 A 7 6 C Y 9.0 8.0
4 2019-08-02 09:50:10.3 C 9 8 C Y 9.0 8.0
5 2019-08-02 09:50:10.4 C 11 10 C Y 11.0 10.0
6 2019-08-02 09:50:10.5 C 13 12 C Y 13.0 12.0
7 2019-08-02 09:50:10.6 B 15 14 B Z 0.0 0.0
8 2019-08-02 09:50:10.6 B 17 16 A NaN NaN NaN
我们也可以使用GroupBy.transform
:
#matches
matches=df.Object.eq(df.Object2)
#Creating conditions
condition_z=df['Value']=='Z'
not_null=df['Value'].notnull()
#Creating DataFrame to fill
df12=( df[['1','2']].where(matches)
.groupby(df['Time'],sort=False)
.transform('first') )
#fill 0 on Value is Z and discarting NaN
df[['A1','A2']] =df12.where(not_null).mask(condition_z,0)
print(df)
如果只有少数条件使用DataFrame.loc
按条件赋值:
m1 = df['Value'].isin(['X','Y'])
m2 = df['Value'] == 'Z'
df[['A1','A2']] = df.loc[m1, ['1','2']]
df.loc[m2, ['A1','A2']] = 0
print(df)
Time Object 1 2 Object2 Value A1 A2
0 2019-08-02 09:50:10.1 A 1 0 A X 1.0 0.0
1 2019-08-02 09:50:10.1 B 1 1 A X 1.0 1.0
2 2019-08-02 09:50:10.2 A 5 4 NaN NaN NaN NaN
3 2019-08-02 09:50:10.3 C 7 6 C Y 7.0 6.0
4 2019-08-02 09:50:10.3 A 9 8 C Y 9.0 8.0
5 2019-08-02 09:50:10.4 C 11 10 NaN NaN NaN NaN
6 2019-08-02 09:50:10.5 C 13 12 B NaN NaN NaN
7 2019-08-02 09:50:10.6 B 15 14 B Z 0.0 0.0
8 2019-08-02 09:50:10.6 B 17 16 B NaN NaN NaN
numpy.select
和广播掩码的另一种解决方案:
m1 = df['Value'].isin(['X','Y'])
m2 = df['Value'] == 'Z'
masks = [m1.values[:, None], m2.values[:, None]]
values = [df[['1','2']].values, 0]
df[['A1','A2']] = pd.DataFrame(np.select(masks,values, default=np.nan), index=df.index)
print(df)
Time Object 1 2 Object2 Value A1 A2
0 2019-08-02 09:50:10.1 A 1 0 A X 1.0 0.0
1 2019-08-02 09:50:10.1 B 1 1 A X 1.0 1.0
2 2019-08-02 09:50:10.2 A 5 4 NaN NaN NaN NaN
3 2019-08-02 09:50:10.3 C 7 6 C Y 7.0 6.0
4 2019-08-02 09:50:10.3 A 9 8 C Y 9.0 8.0
5 2019-08-02 09:50:10.4 C 11 10 NaN NaN NaN NaN
6 2019-08-02 09:50:10.5 C 13 12 B NaN NaN NaN
7 2019-08-02 09:50:10.6 B 15 14 B Z 0.0 0.0
8 2019-08-02 09:50:10.6 B 17 16 B NaN NaN NaN
df['A1'] = df.apply(lambda row: row['1'] if row['Value'] == 'X' else np.nan, axis=1)
我不得不对您的 dataframe 进行一些调整,因为它与您问题中的预期结果不符。
df = pd.DataFrame(
{
"Time": [
"2019-08-02 09:50:10.1",
"2019-08-02 09:50:10.1",
"2019-08-02 09:50:10.2",
"2019-08-02 09:50:10.3",
"2019-08-02 09:50:10.3",
"2019-08-02 09:50:10.4",
"2019-08-02 09:50:10.5",
"2019-08-02 09:50:10.6",
"2019-08-02 09:50:10.6",
],
"Object": ["A", "B", "A", "C", "A", "C", "C", "B", "B"],
"1": [1, 1, 5, 7, 9, 11, 13, 15, 17],
"2": [0, 1, 4, 6, 8, 10, 12, 14, 16],
"Object2": ["A", "A", np.nan, "C", "C", "C", "C", "B", "A"],
"Value": ["X", "X", np.nan, "Y", "Y", "Y", "Y", "Z", np.nan],
}
)
这是一个矢量化解决方案,应该在大数据上表现良好。
第一步是确保 dataframe 按时间排序。
df = df.sort_values("Time")
复制第 1 列和第 2 列
df["A1"] = df["1"]
df["A2"] = df["2"]
将使用索引值来获取每个时间组的第一行。
df = df.reset_index()
我对 list/isin 解决方案不太满意。 好奇是否有人知道一种不那么老套的方法来做到这一点?
li = df.groupby("Time").index.first().tolist()
print(li)
[0, 2, 3, 5, 6, 7]
print(df)
index Time Object 1 2 Object2 Value A1 A2
0 0 2019-08-02 09:50:10.1 A 1 0 A X 1 0
1 1 2019-08-02 09:50:10.1 B 1 1 A X 1 1
2 2 2019-08-02 09:50:10.2 A 5 4 NaN NaN 5 4
3 3 2019-08-02 09:50:10.3 C 7 6 C Y 7 6
4 4 2019-08-02 09:50:10.3 A 9 8 C Y 9 8
5 5 2019-08-02 09:50:10.4 C 11 10 C Y 11 10
6 6 2019-08-02 09:50:10.5 C 13 12 C Y 13 12
7 7 2019-08-02 09:50:10.6 B 15 14 B Z 15 14
8 8 2019-08-02 09:50:10.6 B 17 16 A NaN 17 16
过滤 dataframe 以获取除列表中的行之外的所有行,然后将它们设置为 np.NaN
df.loc[~df.index.isin(li), ["A1", "A2"]] = np.NaN
向前填充第一行值。
df[["A1", "A2"]] = df[["A1", "A2"]].ffill(axis=0)
将 z 设置为 0 并将 np.NaN 设置为 np.NaN
df.loc[df["Value"] == "Z", ["A1", "A2"]] = 0
df.loc[df["Value"].isnull(), ["A1", "A2"]] = np.NaN
删除索引列
df = df.drop("index", axis=1)
print(df)
Time Object 1 2 Object2 Value A1 A2
0 2019-08-02 09:50:10.1 A 1 0 A X 1.0 0.0
1 2019-08-02 09:50:10.1 B 1 1 A X 1.0 0.0
2 2019-08-02 09:50:10.2 A 5 4 NaN NaN NaN NaN
3 2019-08-02 09:50:10.3 C 7 6 C Y 7.0 6.0
4 2019-08-02 09:50:10.3 A 9 8 C Y 7.0 6.0
5 2019-08-02 09:50:10.4 C 11 10 C Y 11.0 10.0
6 2019-08-02 09:50:10.5 C 13 12 C Y 13.0 12.0
7 2019-08-02 09:50:10.6 B 15 14 B Z 0.0 0.0
8 2019-08-02 09:50:10.6 B 17 16 A NaN NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.