简体   繁体   中英

Exclude certain values when checking for a condition with pandas

I have a following dataframe.(Actual columns in the dataframe are more than 30)

ID  col1    col2    col3   col4
 1  a##      b##    HO      HO
 2  c##      d23   c##      HO
 3  a##      k##    z##     s#
 4  c##      b12    a12     c12
 5   b##     HO     z##      HO

I want to make a new dataframe filtering rows that have all the strings ending with '##' and if one of the cells in the row contains 'HO',I want to skip it and return a row if the rest of the columns contain string ending with ##

In the above case new dataframe would be like this:

ID  col1    col2    col3  col4
1   a##      b##    HO    HO
3   a##      k##    z##   s#
5   b##      HO     z##   HO

currently I am doing

m = np.logical_and.reduce([
[x.endswith('##')  for x in df[c] ] for c in df.columns if x!='HO'] )

But df[m] is only giving one following row and not the other two

3   a##      k##    z##

How can I fix this?

You can combine Boolean filters as below. Note are only 2 valid scenarios: either you have 2 strings ending in '##' and one equal to 'HO' , or you have all 3 strings ending in '##' .

suffix = df.iloc[:, 1:].apply(lambda s: s.str[-2:].eq('##')).sum(1)
value = df.iloc[:, 1:].eq('HO').sum(1)

res = df[(suffix.eq(2) & value.eq(1)) | suffix.eq(3)]

print(res)

   ID col1 col2 col3
0   1  a##  b##   HO
2   3  a##  k##  z##
4   5  b##   HO  z##

You can replace "HO" with a dummy string '##', store this result as a temporary DataFrame, compute the mask on the temp, and use that to index back into df .

# df = df.set_index('ID')  # Run this if not already done.
df

  col1 col2 col3
ID               
1   a##  b##   HO
2   c##  d23  c##
3   a##  k##  z##
4   c##  b12  a12
5   b##   HO  z##

u = df.replace('HO', '##')
m = np.logical_and.reduce([
   [x.endswith('##')  for x in u[c]] for c in u.columns]
)
df[m]

   col1 col2 col3
ID               
1   a##  b##   HO
3   a##  k##  z##
5   b##   HO  z##

You can try this:

import pandas as pd
import numpy as np

data = {'col1':['a##', 'c##', 'a##', 'c##', 'b##'], 'col2':['b##', 'd23', 'k##', 'b12', 'HO'], 'col3': ['HO', 'c##', 'z##', 'a12', 'z##']}

df = pd.DataFrame(data = data)

m = np.logical_and.reduce([[('##' in x) | (x=='HO') for x in df[c] ]for c in df.columns  ])

print(df, '\n\n',df[m])


 col1 col2 col3
0  a##  b##   HO
1  c##  d23  c##
2  a##  k##  z##
3  c##  b12  a12
4  b##   HO  z## 

   col1 col2 col3
0  a##  b##   HO
2  a##  k##  z##
4  b##   HO  z##

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM