简体   繁体   中英

Transferring values between two columns in a pandas data frame

I have a pandas data frame like this:

p q
0.5 0.5
0.6 0.4
0.3 0.7
0.4 0.6
0.9 0.1

So, I want to know, how can I transfer greater values to p column and vice versa for q column (Transfering smaller values to q column) like this:

p q
0.5 0.5
0.6 0.4
0.7 0.3
0.6 0.4
0.9 0.1

You could store some conditional series with np.where() and then apply them to the dataframe:

s1 = np.where(df['p'] < df['q'], df['q'], df['p'])
s2 = np.where(df['p'] > df['q'], df['q'], df['p'])
df['p'] = s1
df['q'] = s2
df
Out[1]: 
     p    q
0  0.5  0.5
1  0.6  0.4
2  0.7  0.3
3  0.6  0.4
4  0.9  0.1

You could also use .where() :

s1 = df['p'].where(df['p'] > df['q'], df['q'])
s2 = df['p'].where(df['p'] < df['q'], df['q'])
df['p'] = s1
df['q'] = s2
df

I tested the execution times over varying rows from 100 rows to 1 million rows, and the answers that require passing axis=1 can be 10,000 times slower! :

  1. Erfan's numpy answer looks to be the fastest executing in milliseconds for large datasets
  2. My .where() answer also has great performance that keeps the time to execute in milliseconds (I assume `np.where() would have a similar outcome.
  3. I thought MHDG7's answer would be the slowest, but it is actually faster than Alexander's answer.
  4. I guess Alexander's answer is slow, because it requires passing axis=1 . The fact that MGDG7's and Alexander's answer is row-wise (with axis=1 ), it means that it can slow things down tremendously for large dataframes.

As you can see a million row dataframe was taking minutes to execute. And, if you had a 10 million row to 100 million row dataframe these one-liners could take hours to execute.


from timeit import timeit
df = d.copy()

def df_where(df):
    s1 = df['p'].where(df['p'] > df['q'], df['q'])
    s2 = df['p'].where(df['p'] < df['q'], df['q'])
    df['p'] = s1
    df['q'] = s2
    return df


def agg_maxmin(df):
    df[['p', 'q']] = df[['p', 'q']].agg([max, min], axis=1)
    return df


def np_flip(df):
    df = pd.DataFrame(np.flip(np.sort(df), axis=1), columns=df.columns)
    return df


def lambda_x(df):
    df = df.apply(lambda x: [x['p'],x['q']] if x['p']>x['q'] else [x['q'],x['p']],axis=1,result_type='expand')
    return df


res = pd.DataFrame(
    index=[20, 200, 2000, 20000, 200000],
    columns='df_where agg_maxmin np_flip lambda_x'.split(),
    dtype=float
)

for i in res.index:
    d = pd.concat([df]*i)
    for j in res.columns:
        stmt = '{}(d)'.format(j)
        setp = 'from __main__ import d, {}'.format(j)
        print(stmt, d.shape)
        res.at[i, j] = timeit(stmt, setp, number=1)

res.plot(loglog=True);

在此处输入图片说明

Use numpy.sort to sort over the horizontal axis ascending, then flip the arrays over axis=1 :

df = pd.DataFrame(np.flip(np.sort(df), axis=1), columns=df.columns)
     p    q
0  0.5  0.5
1  0.6  0.4
2  0.7  0.3
3  0.6  0.4
4  0.9  0.1

Use agg , pass a list of functions ( max and min ) and specify axis=1 to have those functions be applied to the columns row-wise.

df[['p', 'q']] = df[['p', 'q']].agg([max, min], axis=1)

>>> df
     p    q
0  0.5  0.5
1  0.6  0.4
2  0.7  0.3
3  0.6  0.4
4  0.9  0.1

Simple solutions are not always the most performant (eg the one above). The following solution is significantly faster. It masks the dataframe for where column p is less than column q , and then swaps the values.

mask = df['p'].lt(df['q'])
df.loc[mask, ['p', 'q']] = df.loc[mask, ['q', 'p']].to_numpy()
>>> df
     p    q
0  0.5  0.5
1  0.6  0.4
2  0.7  0.3
3  0.6  0.4
4  0.9  0.1

您可以使用应用功能:

df[['p','q']] = df.apply(lambda x: [x['p'],x['q']] if x['p']>x['q'] else [x['q'],x['p']],axis=1,result_type='expand' )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM