简体   繁体   中英

Python: Count number of rows containing text within range of columns

The answers in Count number of rows when row contains certain text got me part way...

Columns are labeled "1a.", "2a.", and "3a." Rows are each labeled with unique identifiers (random alpha-numeric code).

Table

How do you count how many rows contain at least 1 of 10 letters across multiple columns?

This code works for one column: len(df[df['1a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')])

I tried for multiple columns using len(df[df['1a.'|'2a.'|'3a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')]) and get an error:


TypeError Traceback (most recent call last) in ----> 1 len(df[df['1a.'|'2a.'|'3a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')])

TypeError: unsupported operand type(s) for |: 'str' and 'str'

The row should only be counted once, whether the three columns contain "A" and "I" and "M" (all three letters on the list) OR "A" and "B" and "L" (last two letters not on the list).

You can use the logical operation on 2 columns using & and | for example:

df[df['1a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')] | df[df[|'2a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')] | df[df['3a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')]

Logical operation on two columns of a dataframe

So the complete answer would be:

(df[df['1a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')] | df[df[|'2a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')] | df[df['3a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')]).value_counts()[True]

By putting the letters you are want to search for in a list search_for_items , you can get what you want in two lines

search_for_items = ['A','B','C']
boolean_series = df.apply(lambda x: bool(set(list(x)) & set(search_for_items)), axis=1)
num_of_rows = boolean_series.sum()

Explanation :

1- Get the items you need to search for in a list

2- Get a boolean series by finding if two sets intersect for at least one item. The first set represents the values in a dataframe row. The second set represents the items you are searching for.

3- Finally you apply the sum function to this series summing whenever it is true.

Example :

import pandas as pd

df = pd.DataFrame({ 'a1':['A','B', 'Z','D','E','F','G'],
                    'a2':['A','Q', 'C','D','E','F','G'],
                    'a3':['A','Z', 'Q','D','E','F','G']
                  })
search_for_items = ['A','B','C']
df
    a1  a2  a3
0   A   A   A
1   B   Q   Z
2   Z   C   Q
3   D   D   D
4   E   E   E
5   F   F   F
6   G   G   G

Solution:

boolean_series = df.apply(lambda x: bool(set(list(x)) & set(search_for_items)), axis=1)
num_of_rows = boolean_series.sum()
num_of_rows
3

Which returned 3 as expected because the first three rows in the dataframe contain an A or a B or a C .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM