[英]Transferring values between two columns in a pandas data frame
I have a pandas data frame like this:我有一个像这样的熊猫数据框:
p q
0.5 0.5
0.6 0.4
0.3 0.7
0.4 0.6
0.9 0.1
So, I want to know, how can I transfer greater values to p column and vice versa for q column (Transfering smaller values to q column) like this:所以,我想知道,对于 q 列(将较小的值传输到 q 列),如何将更大的值传输到 p 列,反之亦然:
p q
0.5 0.5
0.6 0.4
0.7 0.3
0.6 0.4
0.9 0.1
You could store some conditional series with np.where()
and then apply them to the dataframe:您可以使用
np.where()
存储一些条件系列,然后将它们应用于数据帧:
s1 = np.where(df['p'] < df['q'], df['q'], df['p'])
s2 = np.where(df['p'] > df['q'], df['q'], df['p'])
df['p'] = s1
df['q'] = s2
df
Out[1]:
p q
0 0.5 0.5
1 0.6 0.4
2 0.7 0.3
3 0.6 0.4
4 0.9 0.1
You could also use .where()
:你也可以使用
.where()
:
s1 = df['p'].where(df['p'] > df['q'], df['q'])
s2 = df['p'].where(df['p'] < df['q'], df['q'])
df['p'] = s1
df['q'] = s2
df
I tested the execution times over varying rows from 100 rows to 1 million rows, and the answers that require passing axis=1
can be 10,000 times slower!
我测试了从 100 行到 100 万行的不同行的执行时间,需要传递
axis=1
的答案可能会10,000 times slower!
: :
.where()
answer also has great performance that keeps the time to execute in milliseconds (I assume `np.where() would have a similar outcome..where()
答案也有很好的性能,可以将执行时间保持在几毫秒内(我假设 `np.where() 会有类似的结果。axis=1
.axis=1
。 The fact that MGDG7's and Alexander's answer is row-wise (with axis=1
), it means that it can slow things down tremendously for large dataframes. axis=1
),这意味着它可以极大地减慢大型数据帧的速度。 As you can see a million row dataframe was taking minutes to execute.如您所见,一百万行数据帧需要几分钟才能执行。 And, if you had a 10 million row to 100 million row dataframe these one-liners could take hours to execute.
而且,如果您有一个 1000 万行到 1 亿行的数据帧,这些单行程序可能需要数小时才能执行。
from timeit import timeit
df = d.copy()
def df_where(df):
s1 = df['p'].where(df['p'] > df['q'], df['q'])
s2 = df['p'].where(df['p'] < df['q'], df['q'])
df['p'] = s1
df['q'] = s2
return df
def agg_maxmin(df):
df[['p', 'q']] = df[['p', 'q']].agg([max, min], axis=1)
return df
def np_flip(df):
df = pd.DataFrame(np.flip(np.sort(df), axis=1), columns=df.columns)
return df
def lambda_x(df):
df = df.apply(lambda x: [x['p'],x['q']] if x['p']>x['q'] else [x['q'],x['p']],axis=1,result_type='expand')
return df
res = pd.DataFrame(
index=[20, 200, 2000, 20000, 200000],
columns='df_where agg_maxmin np_flip lambda_x'.split(),
dtype=float
)
for i in res.index:
d = pd.concat([df]*i)
for j in res.columns:
stmt = '{}(d)'.format(j)
setp = 'from __main__ import d, {}'.format(j)
print(stmt, d.shape)
res.at[i, j] = timeit(stmt, setp, number=1)
res.plot(loglog=True);
Use numpy.sort
to sort over the horizontal axis ascending, then flip the arrays over axis=1
:使用
numpy.sort
对水平轴升序进行排序,然后在axis=1
翻转数组:
df = pd.DataFrame(np.flip(np.sort(df), axis=1), columns=df.columns)
p q
0 0.5 0.5
1 0.6 0.4
2 0.7 0.3
3 0.6 0.4
4 0.9 0.1
Use agg
, pass a list of functions ( max
and min
) and specify axis=1
to have those functions be applied to the columns row-wise.使用
agg
,传递函数列表( max
和min
)并指定axis=1
以将这些函数逐行应用于列。
df[['p', 'q']] = df[['p', 'q']].agg([max, min], axis=1)
>>> df
p q
0 0.5 0.5
1 0.6 0.4
2 0.7 0.3
3 0.6 0.4
4 0.9 0.1
Simple solutions are not always the most performant (eg the one above).简单的解决方案并不总是最高效的(例如上面的那个)。 The following solution is significantly faster.
以下解决方案要快得多。 It masks the dataframe for where column
p
is less than column q
, and then swaps the values.它屏蔽列
p
小于列q
的数据框,然后交换值。
mask = df['p'].lt(df['q'])
df.loc[mask, ['p', 'q']] = df.loc[mask, ['q', 'p']].to_numpy()
>>> df
p q
0 0.5 0.5
1 0.6 0.4
2 0.7 0.3
3 0.6 0.4
4 0.9 0.1
您可以使用应用功能:
df[['p','q']] = df.apply(lambda x: [x['p'],x['q']] if x['p']>x['q'] else [x['q'],x['p']],axis=1,result_type='expand' )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.