简体   繁体   English

Python:计算包含列范围内文本的行数

[英]Python: Count number of rows containing text within range of columns

The answers in Count number of rows when row contains certain text got me part way... 当行包含某些文本时计算行数中的答案让我有点...

Columns are labeled "1a.", "2a.", and "3a."列标记为“1a.”、“2a.”和“3a.”。 Rows are each labeled with unique identifiers (random alpha-numeric code).每行都标有唯一标识符(随机字母数字代码)。

Table桌子

How do you count how many rows contain at least 1 of 10 letters across multiple columns?您如何计算多列中至少包含 10 个字母中的 1 个的行数?

This code works for one column: len(df[df['1a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')])此代码适用于一列: len(df[df['1a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')])

I tried for multiple columns using len(df[df['1a.'|'2a.'|'3a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')]) and get an error:我尝试使用len(df[df['1a.'|'2a.'|'3a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')])并得到一个错误:


TypeError Traceback (most recent call last) in ----> 1 len(df[df['1a.'|'2a.'|'3a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')]) TypeError Traceback(最近一次调用最后一次)在----> 1 len(df[df['1a.'|'2a.'|'3a.'].str.contains('A|I|M|Q| C|K|G|O|E|S')])

TypeError: unsupported operand type(s) for |: 'str' and 'str'类型错误:不支持 | 的操作数类型:'str' 和 'str'

The row should only be counted once, whether the three columns contain "A" and "I" and "M" (all three letters on the list) OR "A" and "B" and "L" (last two letters not on the list).该行应该只计算一次,无论三列是否包含“A”和“I”和“M”(列表中的所有三个字母)或“A”和“B”和“L”(最后两个字母不在列表中)列表)。

You can use the logical operation on 2 columns using & and |您可以使用&|对 2 列进行逻辑运算。 for example:例如:

df[df['1a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')] | df[df[|'2a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')] | df[df['3a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')]

Logical operation on two columns of a dataframe 对数据帧的两列进行逻辑运算

So the complete answer would be:所以完整的答案是:

(df[df['1a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')] | df[df[|'2a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')] | df[df['3a.'].str.contains('A|I|M|Q|C|K|G|O|E|S')]).value_counts()[True]

By putting the letters you are want to search for in a list search_for_items , you can get what you want in two lines通过将您要搜索的字母放在列表search_for_items ,您可以在两行中获得您想要的内容

search_for_items = ['A','B','C']
boolean_series = df.apply(lambda x: bool(set(list(x)) & set(search_for_items)), axis=1)
num_of_rows = boolean_series.sum()

Explanation :说明

1- Get the items you need to search for in a list 1- 获取您需要在列表中搜索的项目

2- Get a boolean series by finding if two sets intersect for at least one item. 2- 通过查找两个集合是否至少有一个项目相交来获取布尔系列。 The first set represents the values in a dataframe row.第一组表示数据帧行中的值。 The second set represents the items you are searching for.第二组代表您正在搜索的项目。

3- Finally you apply the sum function to this series summing whenever it is true. 3- 最后,只要它为真,就将 sum 函数应用于这个系列的求和。

Example :示例

import pandas as pd

df = pd.DataFrame({ 'a1':['A','B', 'Z','D','E','F','G'],
                    'a2':['A','Q', 'C','D','E','F','G'],
                    'a3':['A','Z', 'Q','D','E','F','G']
                  })
search_for_items = ['A','B','C']
df
    a1  a2  a3
0   A   A   A
1   B   Q   Z
2   Z   C   Q
3   D   D   D
4   E   E   E
5   F   F   F
6   G   G   G

Solution:解决方案:

boolean_series = df.apply(lambda x: bool(set(list(x)) & set(search_for_items)), axis=1)
num_of_rows = boolean_series.sum()
num_of_rows
3

Which returned 3 as expected because the first three rows in the dataframe contain an A or a B or a C .由于数据帧中的前三行包含ABC因此按预期返回 3 。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM