Given a Pandas Dataframe like this:
A B C
-------------------------
A. b. b. d. a
c. c. d
a. k. b. b
a. b. a, B
Code:
df = pd.DataFrame({
'A': ['A. b.', 'c.', 'a. k.', 'a.'],
'B': ['b. d.', 'c.', 'b.', 'b.'],
'C': ['a', 'd', 'b', 'a, B']
})
I want to select all rows where A or B contain any value from C. Here, the result would be:
A B C
-------------------------
A. b. b. d. a
a. k. b. b
a. b. a, B
All cells contain values in a simple delimited format (can use split
).
I've tried:
df[df['A'].str.contains(df['C'].split(','))]
But no success so far.
Assuming from your sample output that your comparisons are case-insensitive:
mask = pd.DataFrame({
'AB': (df.A + df.B).str.lower().map(set),
'C': df.C.str.split(',').map(set)
}).apply(lambda row: bool(row['AB'].intersection(row['C'])), axis=1)
df[mask].reset_index(drop=True)
Description:
Combine columns A
and B
and convert each cell value to a set of the characters it contains. Split column C
delimited values to a set as well and check if the intersection between AB
and C
is empty or not, use the resulting bool series as a mask for your original dataframe.
Timings:
def f():
mask = pd.DataFrame({
'AB': (df.A + ' ' + df.B).str.lower().map(set),
'C': df.C.str.split(',').map(set)
}).apply(lambda row: bool(row['AB'].intersection(row['C'])), axis=1)
df[mask].reset_index(drop=True)
%timeit f
Output:
27.4 ns ± 0.483 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
I want to select all rows where A or B contain any value from C
You can use .isin()
here. Here is a small example:
df1 = pd.DataFrame([[1,4,1], [2,5,4], [3,6,7]], columns=['a','b','c'])
df1.loc[df1['a'].isin(df1['c'])]
#output:
a b c
0 1 4 1
The reason why this returns the row is does is because the value 1
is in column c
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.