简体   繁体   中英

Select column of pandas dataframe based on value

I have a quite simple issue which I can't get solved:

I have a dataframe which looks like the following:

               DB         GOOG        ATVI         TSLA        TWTR
1.0    -0.0129755  -0.00463688 -0.00408223    0.0653678   0.0779629
2.0     -0.052772  -0.00712359  -0.0120323    0.0230537   0.0245435
3.0    0.00875274   0.00762426  0.00176186    0.0834672  -0.0017326
4.0    -0.0125196   0.00657628 -0.00235884    0.0502074   0.0157572
5.0    -0.0470443  -0.00382168  -0.0153009   -0.0325997  -0.0235564
6.0     0.0140261  -0.00630647 -0.00265291    -0.037598  -0.0454938
7.0   0.000624415  -0.00429897 -0.00088587 -9.73558e-05  -0.0216945
8.0    -0.0138933  -0.00455289   -0.027357  -0.00682128  -0.0186916
9.0   -0.00311624 -0.000168211  -0.0100577  -0.00894537 -0.00181214
10.0    0.0864933    0.0151531     0.05061    0.0294589   0.0395802

For every row I want to find the column name corresponding to the largest, second largest and smallest value.

Example:

For the first row (index 1.0) I want to extract TWTR (largest value), TSLA (second largest value) and DB (smallest value).

...

For the fifth row (index 5.0) I want to extract GOOG (largest value), ATVI (second largest value) and DB (smallest value)

.. and so on.

What is the simplest way to do this? Thanks!

You can use very fast numpy.argsort for change columns names by sorted values per row:

print (np.argsort(-df.values, axis=1))
[[4 3 2 1 0]
 [4 3 1 2 0]
 [3 0 1 2 4]
 [3 4 1 2 0]
 [1 2 4 3 0]
 [0 2 1 3 4]
 [0 3 2 1 4]
 [1 3 0 4 2]
 [1 4 0 3 2]
 [0 2 4 3 1]]

print (df.columns[np.argsort(-df.values, axis=1)])
Index([['TWTR', 'TSLA', 'ATVI', 'GOOG', 'DB'],
       ['TWTR', 'TSLA', 'GOOG', 'ATVI', 'DB'],
       ['TSLA', 'DB', 'GOOG', 'ATVI', 'TWTR'],
       ['TSLA', 'TWTR', 'GOOG', 'ATVI', 'DB'],
       ['GOOG', 'ATVI', 'TWTR', 'TSLA', 'DB'],
       ['DB', 'ATVI', 'GOOG', 'TSLA', 'TWTR'],
       ['DB', 'TSLA', 'ATVI', 'GOOG', 'TWTR'],
       ['GOOG', 'TSLA', 'DB', 'TWTR', 'ATVI'],
       ['GOOG', 'TWTR', 'DB', 'TSLA', 'ATVI'],
       ['DB', 'ATVI', 'TWTR', 'TSLA', 'GOOG']],
      dtype='object')

df = pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)][:,[0,1,-1]], index=df.index)
df.columns = ['largest','second largest','smallest']
print (df)
     largest second largest smallest
1.0     TWTR           TSLA       DB
2.0     TWTR           TSLA       DB
3.0     TSLA             DB     TWTR
4.0     TSLA           TWTR       DB
5.0     GOOG           ATVI       DB
6.0       DB           ATVI     TWTR
7.0       DB           TSLA     TWTR
8.0     GOOG           TSLA     ATVI
9.0     GOOG           TWTR     ATVI
10.0      DB           ATVI     GOOG

Another solutions (slowier) with apply and custom function where sort each row and get indexes :

def f(x):
    x = x.sort_values()
    return pd.Series([x.index[-1], x.index[-2], x.index[0]],
                     index=['largest','second largest','smallest'])

df = df.apply(f ,axis=1)
print (df)
     largest second largest smallest
1.0     TWTR           TSLA       DB
2.0     TWTR           TSLA       DB
3.0     TSLA             DB     TWTR
4.0     TSLA           TWTR       DB
5.0     GOOG           ATVI       DB
6.0       DB           ATVI     TWTR
7.0       DB           TSLA     TWTR
8.0     GOOG           TSLA     ATVI
9.0     GOOG           TWTR     ATVI
10.0      DB           ATVI     GOOG

df = df.apply(lambda x: x.sort_values().index ,axis=1)
df = df.iloc[:, [-1,-2,0]]
df.columns = ['largest','second largest','smallest']
print (df)
     largest second largest smallest
1.0     TWTR           TSLA       DB
2.0     TWTR           TSLA       DB
3.0     TSLA             DB     TWTR
4.0     TSLA           TWTR       DB
5.0     GOOG           ATVI       DB
6.0       DB           ATVI     TWTR
7.0       DB           TSLA     TWTR
8.0     GOOG           TSLA     ATVI
9.0     GOOG           TWTR     ATVI
10.0      DB           ATVI     GOOG

Timings :

#[10000 rows x 5 columns]
df = pd.concat([df]*1000).reset_index(drop=True)

In [357]: %timeit pd.DataFrame(df.columns[np.argsort(-df.values, axis=1)][:,[0,1,-1]], index=df.index, columns=['largest','second largest','smallest'])
1000 loops, best of 3: 974 µs per loop

In [358]: %timeit df.apply(f ,axis=1)
1 loop, best of 3: 3.91 s per loop

In [361]: %timeit df.apply(lambda x: x.sort_values().index ,axis=1).iloc[:, [-1,-2,0]]
1 loop, best of 3: 1.88 s per loop

In [362]: %timeit df.apply(lambda x: x.sort_values().index.to_series().iloc[0], axis=1)
1 loop, best of 3: 2.47 s per loop

In [363]: %timeit df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[0], axis=1)
1 loop, best of 3: 2.51 s per loop

In [364]: %timeit df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[1], axis=1)
1 loop, best of 3: 2.52 s per loop

In [365]: %timeit [df.T.sort_values(by=k).T.keys()[-1] for k in df.T.keys()]
1 loop, best of 3: 6.42 s per loop

Not sure if it is the simplest solution, but it works:

df =pd.DataFrame(np.random.rand(5,5), columns=['a','b','c','d','e'])
largest = [df.T.sort_values(by=k).T.keys()[-1] for k in df.T.keys()]
largest2 = [df.T.sort_values(by=k).T.keys()[-2] for k in df.T.keys()]
smallest = [df.T.sort_values(by=k).T.keys()[0] for k in df.T.keys()]

You can uses apply based on the row axis and then sort_values to order the data. Then you use integer-based indexing on the sorted series' index to get what you want:

# smallest value
df.apply(lambda x: x.sort_values().index[0], axis=1)
# biggest value
df.apply(lambda x: x.sort_values(ascending=False).index[0], axis=1)
# 2nd biggest value
df.apply(lambda x: x.sort_values(ascending=False).index[1], axis=1)

If applied, the output looks like:

In [38]: df.apply(lambda x: x.sort_values().index.to_series().iloc[0], axis=1)
Out[38]: 
1.0       DB
2.0       DB
3.0     TWTR
4.0       DB
5.0       DB
6.0     TWTR
7.0     TWTR
8.0     ATVI
9.0     ATVI
10.0    GOOG
dtype: object

In [39]: df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[0], axis=1)
Out[39]: 
1.0     TWTR
2.0     TWTR
3.0     TSLA
4.0     TSLA
5.0     GOOG
6.0       DB
7.0       DB
8.0     GOOG
9.0     GOOG
10.0      DB
dtype: object

In [40]: df.apply(lambda x: x.sort_values(ascending=False).index.to_series().iloc[1], axis=1)
Out[40]: 
1.0     TSLA
2.0     TSLA
3.0       DB
4.0     TWTR
5.0     ATVI
6.0     ATVI
7.0     TSLA
8.0     TSLA
9.0     TWTR
10.0    ATVI
dtype: object

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM