[英]Pandas/Python: How to create new column based on values from other columns and apply extra condition to this new column
I have a pandas dataframe and I want to create a new column BB based on the below condition.我有一个 pandas dataframe,我想根据以下条件创建一个新列BB 。
I was able to achieve the first step using我能够使用
df.loc[df['TGR1'] == 0, 'BB'] = 0
I also tried to use np.where to come up with but I can figure out the right way to go about this.我也尝试使用np.where来解决这个问题,但我可以找到 go 的正确方法。
df['BB'] = np.where(df.TGR1 == 0,0, df.columns == test.TGR1.value )
Dist Track EVENT_ID Date 1 2 3 TGR1 TGR2
311m Cran 174331755 2020-10-19 34.00 5.18 19.10 1 0
311m Cran 174331755 2020-10-19 34.00 5.18 19.10 2 1
311m Cran 174331755 2020-10-19 34.00 5.18 19.10 0 2
311m Cran 174331755 2020-10-19 34.00 5.18 19.10 3 1
311m Cran 174331755 2020-10-19 34.00 5.18 19.10 2 2
311m Cran 174331755 2020-10-19 34.00 5.18 19.10 1 2
Expected Output:预计 Output:
Dist Track EVENT_ID Date 1 2 3 TGR1 TGR2 BB
311m Cran 174331755 2020-10-19 34.00 5.18 19.10 1 0 34.00
311m Cran 174331755 2020-10-19 34.00 5.18 19.10 2 1 5.18
311m Cran 174331755 2020-10-19 34.00 5.18 19.10 0 2 0
311m Cran 174331755 2020-10-19 34.00 5.18 19.10 3 1 19.10
One way is to use numpy advanced indexing :一种方法是使用numpy 高级索引:
import numpy as np
# extract columns 1,2,3 into a numpy array with a zeros column stacked on the left
vals = np.column_stack((np.zeros(len(df)), df[list('123')]))
vals
array([[ 0. , 34. , 5.18, 19.1 ],
[ 0. , 34. , 5.18, 19.1 ],
[ 0. , 34. , 5.18, 19.1 ],
[ 0. , 34. , 5.18, 19.1 ],
[ 0. , 34. , 5.18, 19.1 ],
[ 0. , 34. , 5.18, 19.1 ]])
# use TGR1 values as the column index to extract corresponding values
df['BB'] = vals[np.arange(len(df)), df.TGR1.values]
df
Dist Track EVENT_ID Date 1 2 3 TGR1 TGR2 BB
0 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 1 0 34.00
1 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 2 1 5.18
2 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 0 2 0.00
3 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 3 1 19.10
4 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 2 2 5.18
5 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 1 2 34.00
Here you can try to play some numpy trick as in this answer .在这里你可以尝试玩一些 numpy 的把戏,就像在这个答案中一样。
We first define a matrix with values from columns 1,2 and 3 and add a first column with zeros.我们首先定义一个矩阵,其中包含第 1、2 和 3 列的值,并添加带有零的第一列。
import pandas as pd
import numpy as np
# we first define a matrix
# with len(df) rows and 4 columns
mat = np.zeros((len(df), 4))
# Then we fill the last 3 columns
# with values from df
mat[:,1:] = df[["1", "2", "3"]].values
# Then a vector with values from df["TGR1"]
v = df["TGR1"].values
# Finally we take the given index
# from each row on matrix
df["BB"] = np.take_along_axis(mat, v[:,None], axis=1)
I compared the timing for some of the answers here.我在这里比较了一些答案的时间。 I just took a df
10_000 larger than the original one我刚拿了一个比原来大 10_000 的df
df_bk = pd.concat([df for i in range(10_000)], ignore_index=True)
and before run each test I do df = df_bk.copy()
在运行每个测试之前,我做df = df_bk.copy()
CPU times: user 430 ms, sys: 12.1 ms, total: 442 ms
Wall time: 452 ms
CPU times: user 746 ms, sys: 0 ns, total: 746 ms
Wall time: 746 ms
CPU times: user 5.54 ms, sys: 0 ns, total: 5.54 ms
Wall time: 4.84 ms
CPU times: user 5.93 ms, sys: 141 µs, total: 6.07 ms
Wall time: 5.61 ms
Psidom's solution and mine have basically the same timing. Psidom的解决方案和我的时间基本一致。 Here is a plot这是一个 plot
You can create the column using a list comprehension with your if-else logic您可以使用带有 if-else 逻辑的列表理解来创建列
# Sample data
df = pd.DataFrame({'TGR1':[random.randint(0,3) for i in range(10)],
'1':[random.randint(0,100) for i in range(10)],
'2':[random.randint(101,200) for i in range(10)],
'3':[random.randint(201,300) for i in range(10)]})
# creating the column
df['BB'] = [0 if tgr1_val == 0 else df.loc[ind,str(tgr1_val)]
for ind,tgr1_val in enumerate(df['TGR1'].values)]
df
# TGR1 1 2 3 BB
# 0 0 54 107 217 0
# 1 2 71 128 277 128
# 2 1 25 103 269 25
# 3 0 80 112 279 0
# 4 2 98 167 228 167
# 5 3 26 192 285 285
# 6 0 27 107 228 0
# 7 2 13 103 298 103
# 8 3 28 196 289 289
# 9 2 72 186 251 186
this is done easy with the use of boolean mask as you did it in your step one:使用 boolean 掩码可以轻松完成此操作,就像您在第一步中所做的那样:
df['BB'][df['TGR1'] == 0] = 0
for the other values greaters than 0:对于其他大于 0 的值:
df['BB'][df['TGR1'] == 1] = df['1'][df['TGR1'] == 1]
df['BB'][df['TGR1'] == 2] = df['2'][df['TGR1'] == 2]
df['BB'][df['TGR1'] == 3] = df['3'][df['TGR1'] == 3]
output:
1 2 3 TGR1 BB
0 34.0 5.18 19.1 1 34.00
1 34.0 5.18 19.1 2 5.18
2 34.0 5.18 19.1 0 0.00
3 34.0 5.18 19.1 3 19.10
4 34.0 5.18 19.1 2 5.18
probably it is pretty much readable.可能它是非常可读的。
Drop TGR2temporarily, do alook up of columns using TGR1 and that should do.暂时删除 TGR2,使用 TGR1 查找列,应该这样做。 code below下面的代码
s = df.astype(str).drop('TGR2',1).filter(regex='\d', axis=1).reset_index()#Drop TRG2 and filter rows with digits to allow lookup
i = s.astype(str).columns.get_indexer(s.TGR1)#DO alook up to get columns whose values are in TGR1
df['BB'] = s.values[s.index,i]
Dist Track EVENT_ID Date 1 2 3 TGR1 TGR2 BB
0 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 1 0 34.0
1 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 2 1 5.18
2 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 0 2 0
3 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 3 1 19.1
4 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 2 2 5.18
5 311m Cran 174331755 2020-10-19 34.0 5.18 19.1 1 2 34.0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.