简体   繁体   中英

Sort Pandas data frame by rows that have multiple similar values

I'm trying to sort a pandas data frame by rows that have two specific values in any column. In the sample data below, I would want to select the rows that have a value of 'apple' AND 'grape',

  a     b      c
0 apple orange grape
1 grape apple  banana
2 pear  kiwi   apple

resulting in a filtered data frame that shows:

  a     b      c
0 apple orange grape
1 grape apple  banana

Using the the code below, I can select all the rows that have one specific value:

df[(df == 'orange').any(axis=1)]

The result retuned, as expected, was:

  a     b      c
0 apple orange grape

Using the following line of code, I expected to select the rows that had both values somewhere in the row, but this returned all the rows that had either apple OR grape as a column value:

df[np.isin(df, ['apple', 'grape']).any(axis=1)]

I expected to get only the rows that had apple AND grape using the previous line, but that obviously isn't the correct way to accomplish this. How do I go about selecting rows that only have both values in any column?

Another way is to create a boolean mask:

mask=df.isin(['apple','grape']).sum(1).eq(2)

Finally:

result=df[mask]

output of result :

    a       b       c
0   apple   orange  grape
1   grape   apple   banana

With your shown samples and with boolean masking try following. Using .any function of Pandas.

m1 = (df=='apple').any(1)
m2 = (df=='grape').any(1)
df[m1 & m2]

Output will be as follows:

    a       b       c
0   apple   orange  grape
1   grape   apple   banana

One option is to "count" the number of True s from np.isin on axis=1 using sum then compare whether it is greater than equal to the number of values that are being checked:

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'a': {0: 'apple', 1: 'grape', 2: 'pear'},
    'b': {0: 'orange', 1: 'apple', 2: 'kiwi'},
    'c': {0: 'grape', 1: 'banana', 2: 'apple'}
})

vals = ['apple', 'grape']

filtered = df[np.isin(df, vals).sum(axis=1) >= len(vals)]

print(filtered)

Another option would be to turn the values into a set and apply on axis=1 issubset :

filtered = df[df.apply(set(vals).issubset, axis=1)]

Both give:

       a       b       c
0  apple  orange   grape
1  grape   apple  banana

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM