简体   繁体   中英

Count across dataframe columns based on str.contains (or similar)

I would like to count the number of cells within each row that contain a particular character string, cells which have the particular string more than once should be counted once only.

I can count the number of cells across a row which equal a given value, but when I expand this logic to use str.contains, I have issues, as shown below


d = {'col1': ["a#", "b","c#"], 'col2': ["a", "b","c#"]}
df = pd.DataFrame(d)

#can correctly count across rows using equality 
thisworks =( df =="a#" ).sum(axis=1)

#can count across  a column using str.contains
thisworks1=df['col1'].str.contains('#').sum()

#but cannot use str.contains with a dataframe so what is the alternative
thisdoesnt =( df.str.contains('#') ).sum(axis=1)

Output should be a series showing the number of cells in each row that contain the given character string.

str.contains is a series method. To apply it to whole dataframe you need either agg or apply such as:

df.agg(lambda x: x.str.contains('#')).sum(1)

Out[2358]:
0    1
1    0
2    2
dtype: int64

If you don't like agg nor apply , you may use np.char.find to work directly on underlying numpy array of df

(np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1)

Out[2360]: array([1, 0, 2])

Passing it to series or a columns of df

pd.Series((np.char.find(df.values.tolist(), '#') + 1).astype(bool).sum(1), index=df.index)

Out[2361]:
0    1
1    0
2    2
dtype: int32

Something like this should work:

df = pd.DataFrame({'col1': ['#', '0'], 'col2': ['#', '#']})
df['totals'] = df['col1'].str.contains('#', regex=False).astype(int) +\
               df['col2'].str.contains('#', regex=False).astype(int)
df
#   col1 col2  totals
# 0    #    #       2
# 1    0    #       1

It should generalize to as many columns as you want.

A solution using df.apply :

df = pd.DataFrame({'col1': ["a#", "b","c#"], 
                   'col2': ["a", "b","c#"]})
df
  col1 col2
0   a#    a
1    b    b
2   c#   c#

df['sum'] = df.apply(lambda x: x.str.contains('#'), axis=1).sum(axis=1)

  col1 col2  sum
0   a#    a    1
1    b    b    0
2   c#   c#    2

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM