简体   繁体   中英

Find rows where one of 2 cols contains any value from a third split column in Pandas

Given a Pandas Dataframe like this:

A           B           C
-------------------------
A. b.       b. d.       a
c.          c.          d
a. k.       b.          b
a.          b.          a, B

Code:

df = pd.DataFrame({
    'A': ['A. b.', 'c.', 'a. k.', 'a.'],
    'B': ['b. d.', 'c.', 'b.', 'b.'],
    'C': ['a', 'd', 'b', 'a, B']
})

I want to select all rows where A or B contain any value from C. Here, the result would be:

A           B           C
-------------------------
A. b.       b. d.       a
a. k.       b.          b
a.          b.          a, B

All cells contain values in a simple delimited format (can use split ).

I've tried:

df[df['A'].str.contains(df['C'].split(','))]

But no success so far.

Assuming from your sample output that your comparisons are case-insensitive:

mask = pd.DataFrame({
           'AB': (df.A + df.B).str.lower().map(set),
           'C': df.C.str.split(',').map(set)
       }).apply(lambda row: bool(row['AB'].intersection(row['C'])), axis=1)

df[mask].reset_index(drop=True)

Description:

Combine columns A and B and convert each cell value to a set of the characters it contains. Split column C delimited values to a set as well and check if the intersection between AB and C is empty or not, use the resulting bool series as a mask for your original dataframe.


Timings:

def f():
    mask = pd.DataFrame({
               'AB': (df.A + ' ' + df.B).str.lower().map(set),
               'C': df.C.str.split(',').map(set)
           }).apply(lambda row: bool(row['AB'].intersection(row['C'])), axis=1)

    df[mask].reset_index(drop=True)

%timeit f

Output:

27.4 ns ± 0.483 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I want to select all rows where A or B contain any value from C

You can use .isin() here. Here is a small example:

df1 = pd.DataFrame([[1,4,1], [2,5,4], [3,6,7]], columns=['a','b','c'])
df1.loc[df1['a'].isin(df1['c'])]

#output:

    a   b   c
0   1   4   1

The reason why this returns the row is does is because the value 1 is in column c .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM