在 Pandas 数据框中的两列之间传输值

Question

I have a pandas data frame like this:我有一个像这样的熊猫数据框：

p q
0.5 0.5
0.6 0.4
0.3 0.7
0.4 0.6
0.9 0.1

So, I want to know, how can I transfer greater values to p column and vice versa for q column (Transfering smaller values to q column) like this:所以，我想知道，对于 q 列（将较小的值传输到 q 列），如何将更大的值传输到 p 列，反之亦然：

p q
0.5 0.5
0.6 0.4
0.7 0.3
0.6 0.4
0.9 0.1

Answer 1

You could store some conditional series with np.where() and then apply them to the dataframe:您可以使用np.where()存储一些条件系列，然后将它们应用于数据帧：

s1 = np.where(df['p'] < df['q'], df['q'], df['p'])
s2 = np.where(df['p'] > df['q'], df['q'], df['p'])
df['p'] = s1
df['q'] = s2
df
Out[1]: 
     p    q
0  0.5  0.5
1  0.6  0.4
2  0.7  0.3
3  0.6  0.4
4  0.9  0.1

You could also use .where() :你也可以使用.where() ：

s1 = df['p'].where(df['p'] > df['q'], df['q'])
s2 = df['p'].where(df['p'] < df['q'], df['q'])
df['p'] = s1
df['q'] = s2
df

I tested the execution times over varying rows from 100 rows to 1 million rows, and the answers that require passing axis=1 can be 10,000 times slower!我测试了从 100 行到 100 万行的不同行的执行时间，需要传递axis=1的答案可能会10,000 times slower! : ：

Erfan's numpy answer looks to be the fastest executing in milliseconds for large datasets Erfan 的 numpy 答案似乎是大型数据集以毫秒为单位的最快执行速度
My .where() answer also has great performance that keeps the time to execute in milliseconds (I assume `np.where() would have a similar outcome.我的.where()答案也有很好的性能，可以将执行时间保持在几毫秒内（我假设 `np.where() 会有类似的结果。
I thought MHDG7's answer would be the slowest, but it is actually faster than Alexander's answer.我认为 MHDG7 的答案是最慢的，但实际上它比 Alexander 的答案要快。
I guess Alexander's answer is slow, because it requires passing axis=1 .我猜亚历山大的回答很慢，因为它需要传递axis=1 。 The fact that MGDG7's and Alexander's answer is row-wise (with axis=1 ), it means that it can slow things down tremendously for large dataframes. MGDG7 和 Alexander 的答案是逐行的（ axis=1 ），这意味着它可以极大地减慢大型数据帧的速度。

As you can see a million row dataframe was taking minutes to execute.如您所见，一百万行数据帧需要几分钟才能执行。 And, if you had a 10 million row to 100 million row dataframe these one-liners could take hours to execute.而且，如果您有一个 1000 万行到 1 亿行的数据帧，这些单行程序可能需要数小时才能执行。

from timeit import timeit
df = d.copy()

def df_where(df):
    s1 = df['p'].where(df['p'] > df['q'], df['q'])
    s2 = df['p'].where(df['p'] < df['q'], df['q'])
    df['p'] = s1
    df['q'] = s2
    return df


def agg_maxmin(df):
    df[['p', 'q']] = df[['p', 'q']].agg([max, min], axis=1)
    return df


def np_flip(df):
    df = pd.DataFrame(np.flip(np.sort(df), axis=1), columns=df.columns)
    return df


def lambda_x(df):
    df = df.apply(lambda x: [x['p'],x['q']] if x['p']>x['q'] else [x['q'],x['p']],axis=1,result_type='expand')
    return df


res = pd.DataFrame(
    index=[20, 200, 2000, 20000, 200000],
    columns='df_where agg_maxmin np_flip lambda_x'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df]*i)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        print(stmt, d.shape)
        res.at[i, j] = timeit(stmt, setp, number=1)

res.plot(loglog=True);

Answer 2

Use numpy.sort to sort over the horizontal axis ascending, then flip the arrays over axis=1 :使用numpy.sort对水平轴升序进行排序，然后在axis=1翻转数组：

df = pd.DataFrame(np.flip(np.sort(df), axis=1), columns=df.columns)

     p    q
0  0.5  0.5
1  0.6  0.4
2  0.7  0.3
3  0.6  0.4
4  0.9  0.1

Answer 3

Use agg , pass a list of functions ( max and min ) and specify axis=1 to have those functions be applied to the columns row-wise.使用agg ，传递函数列表（ max和min ）并指定axis=1以将这些函数逐行应用于列。

df[['p', 'q']] = df[['p', 'q']].agg([max, min], axis=1)

>>> df
     p    q
0  0.5  0.5
1  0.6  0.4
2  0.7  0.3
3  0.6  0.4
4  0.9  0.1

Simple solutions are not always the most performant (eg the one above).简单的解决方案并不总是最高效的（例如上面的那个）。 The following solution is significantly faster.以下解决方案要快得多。 It masks the dataframe for where column p is less than column q , and then swaps the values.它屏蔽列p小于列q的数据框，然后交换值。

mask = df['p'].lt(df['q'])
df.loc[mask, ['p', 'q']] = df.loc[mask, ['q', 'p']].to_numpy()
>>> df
     p    q
0  0.5  0.5
1  0.6  0.4
2  0.7  0.3
3  0.6  0.4
4  0.9  0.1

Answer 4

您可以使用应用功能：

df[['p','q']] = df.apply(lambda x: [x['p'],x['q']] if x['p']>x['q'] else [x['q'],x['p']],axis=1,result_type='expand' )

在 Pandas 数据框中的两列之间传输值

问题描述

4 个解决方案

解决方案1
3 2020-10-10 21:26:02

解决方案2
2 2020-10-10 21:26:23

解决方案3
2 2020-10-10 21:27:29

解决方案4
1 已采纳 2020-10-10 21:30:51

在 Pandas 数据框中的两列之间传输值

问题描述

4 个解决方案

解决方案1 3 2020-10-10 21:26:02

解决方案2 2 2020-10-10 21:26:23

解决方案3 2 2020-10-10 21:27:29

解决方案4 1 已采纳 2020-10-10 21:30:51

解决方案1
3 2020-10-10 21:26:02

解决方案2
2 2020-10-10 21:26:23

解决方案3
2 2020-10-10 21:27:29

解决方案4
1 已采纳 2020-10-10 21:30:51