[英]Create new column based on values in other column
This is a column from my DataFrame:这是我的 DataFrame 中的一列:
Index Direction Output
10886 DOWN None
10887 UP None
10888 UP None
10889 UP None
10890 UP None
10891 UP STRONG_UP
10892 UP STRONG_UP
10893 UP STRONG_UP
10894 UP STRONG_UP
10895 UP STRONG_UP
10896 UP STRONG_UP
10897 UP STRONG_UP
10898 UP STRONG_UP
10899 UP STRONG_UP
10900 DOWN None
10901 DOWN None
10902 UP None
10903 UP None
10904 DOWN None
10905 DOWN None
10906 DOWN None
I want to create new column.我想创建新列。
If current Direction value and 5 previous Direction values == UP, cell becomes 'STRONG_UP'如果当前方向值和 5 个先前的方向值 == UP,则单元格变为“STRONG_UP”
If current Direction value and 5 previous Direction values == DOWN, cell becomes 'STRONG_DOWN'如果当前方向值和 5 个先前的方向值 == DOWN,则单元格变为“STRONG_DOWN”
Otherwise value is 'None'否则值为“无”
How to do it?怎么做?
Unfortunately rolling
working only with numbers, so is used decode and encode by map
, but is is slow if large DataFrame:不幸的是,
rolling
只能处理数字,因此使用map
进行解码和编码,但如果大型 DataFrame 会很慢:
def f(x):
if np.all(x == 1):
return 2
elif np.all(x == 0):
return 3
else:
return np.nan
df['Output'] = df['Direction'].map({'UP':1,'DOWN':0})
.rolling(6)
.apply(f)
.map({2:'STRONG_UP',3:'STRONG_DOWN'})
print (df)
Index Direction Output
0 10887 UP NaN
1 10888 UP NaN
2 10889 UP NaN
3 10890 UP NaN
4 10891 UP NaN
5 10892 UP STRONG_UP
6 10893 UP STRONG_UP
7 10894 UP STRONG_UP
8 10895 UP STRONG_UP
9 10896 UP STRONG_UP
10 10897 UP STRONG_UP
11 10898 UP STRONG_UP
12 10899 UP STRONG_UP
13 10900 DOWN NaN
14 10901 DOWN NaN
15 10902 UP NaN
16 10903 UP NaN
17 10904 DOWN NaN
18 10905 DOWN NaN
19 10906 DOWN NaN
Another idea with strides and numpy.select
if performance is important:用另一种思路的进步和
numpy.select
如果性能是非常重要的:
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
n = 6
x = np.concatenate([[None] * (n-1), df['Direction'].to_numpy()])
a = rolling_window(x, n)
print (a)
[[None None None None None 'UP']
[None None None None 'UP' 'UP']
[None None None 'UP' 'UP' 'UP']
[None None 'UP' 'UP' 'UP' 'UP']
[None 'UP' 'UP' 'UP' 'UP' 'UP']
['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
['UP' 'UP' 'UP' 'UP' 'UP' 'UP']
['UP' 'UP' 'UP' 'UP' 'UP' 'DOWN']
['UP' 'UP' 'UP' 'UP' 'DOWN' 'DOWN']
['UP' 'UP' 'UP' 'DOWN' 'DOWN' 'DOWN']
['UP' 'UP' 'DOWN' 'DOWN' 'DOWN' 'UP']
['UP' 'DOWN' 'DOWN' 'DOWN' 'UP' 'UP']
['DOWN' 'DOWN' 'DOWN' 'UP' 'UP' 'DOWN']
['DOWN' 'DOWN' 'UP' 'UP' 'DOWN' 'DOWN']]
m1 = np.all(a == 'UP', axis=1)
m2 = np.all(a == 'DOWN', axis=1)
df['Output'] = np.select([m1, m2], ['STRONG_UP','STRONG_DOWN'], None)
print (df)
Index Direction Output
0 10887 UP None
1 10888 UP None
2 10889 UP None
3 10890 UP None
4 10891 UP None
5 10892 UP STRONG_UP
6 10893 UP STRONG_UP
7 10894 UP STRONG_UP
8 10895 UP STRONG_UP
9 10896 UP STRONG_UP
10 10897 UP STRONG_UP
11 10898 UP STRONG_UP
12 10899 UP STRONG_UP
13 10900 DOWN None
14 10901 DOWN None
15 10902 DOWN None
16 10903 UP None
17 10904 UP None
18 10905 DOWN None
19 10906 DOWN None
Performance : Forst methof was omitted, because too slow.性能:forstmethof被省略了,因为太慢了。
print (pd.show_versions())
INSTALLED VERSIONS
------------------
commit : f2ca0a2665b2d169c97de87b8e778dbed86aea07
python : 3.8.5.final.0
python-bits : 64
OS : Windows
OS-release : 7
Version : 6.1.7601
machine : AMD64
processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en
LOCALE : Slovak_Slovakia.1250
pandas : 1.1.1
numpy : 1.19.1
import perfplot
np.random.seed(123)
def GW(df):
df['group'] = np.r_[True, df.Direction.values[1:] != df.Direction.values[:-1]].cumsum()
df['count'] = df.groupby('group').cumcount()+1
df['result'] = np.where(df['count'] >= 6, 'STRONG_'+df.Direction, np.nan)
df = (df[['Index','Direction','result']])
return df
def ST(df):
def rolling_window(a, window):
shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
strides = a.strides + (a.strides[-1],)
return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
n = 6
x = np.concatenate([[None] * (n-1), df['Direction'].to_numpy()])
a = rolling_window(x, n)
m1 = np.all(a == 'UP', axis=1)
m2 = np.all(a == 'DOWN', axis=1)
df['Output2'] = np.select([m1, m2], ['STRONG_UP','STRONG_DOWN'], None)
return df
def make_df(n):
direction = np.random.choice(['UP','DOWN'], n)
df = pd.DataFrame({
'Index': np.arange(len(direction)),
'Direction': direction
})
return df
perfplot.show(
setup=make_df,
kernels=[GW, ST],
n_range=[2**k for k in range(5, 25)],
logx=True,
logy=True,
equality_check=False,
xlabel='len(df)')
An Idea with numpy and no applied function一个带有 numpy 且没有应用函数的想法
import numpy as np
df['group'] = np.r_[True, df.Direction.values[1:] != df.Direction.values[:-1]].cumsum()
df['count'] = df.groupby('group').cumcount()+1
df['result'] = np.where(df['count'] >= 6, 'STRONG_'+df.Direction, np.nan)
print(df[['Index','Direction','result']])
Output输出
Index Direction result
0 10887 UP NaN
1 10888 UP NaN
2 10889 UP NaN
3 10890 UP NaN
4 10891 UP NaN
5 10892 UP STRONG_UP
6 10893 UP STRONG_UP
7 10894 UP STRONG_UP
8 10895 UP STRONG_UP
9 10896 UP STRONG_UP
10 10897 UP STRONG_UP
11 10898 UP STRONG_UP
12 10899 UP STRONG_UP
13 10900 DOWN NaN
14 10901 DOWN NaN
15 10902 UP NaN
16 10903 UP NaN
17 10904 DOWN NaN
18 10905 DOWN NaN
19 10906 DOWN NaN
Out of curiuosity I run a little benchmark on my laptop (i5-7200u, 8GB Ram, in Jupyter Notebook)出于好奇,我在笔记本电脑(i5-7200u,8GB 内存,在 Jupyter Notebook 中)上运行了一些基准测试
Data was generated like数据是这样生成的
direction = np.random.choice(['UP','DOWN'], 100000)
df = pd.DataFrame({
'Index': np.arange(len(direction)),
'Direction': direction
})
Results结果
N=1000 | N=10000 | N=100000
RA 32.7 ms ± 3.05 ms | 271 ms ± 22.9 ms | 2.35 s ± 60.1 ms
GW 6.33 ms ± 230 µs | 10.2 ms ± 51.4 µs | 63.8 ms ± 1.31 ms
NP 1.33 ms ± 32.5 µs | 8.21 ms ± 555 µs | 74.4 ms ± 2.73 ms
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.