how to count “natural” appearances of certain entities in a string python

Question

I have the following string:

sentence = 'aa bb aa, aabb aa.'

I'm looking for a way to count the following entities:

entity_1 = 'aa'
entity_2 = 'aa bb'

the end result should be this:
entity_1 count = 2 (only 'aa,' and 'aa.', the 'aa' in 'aa bb' should not count as 'aa bb' is its own entity)
entity_2 count = 1 (only 'aa bb')

I've tried using sentence.split(" ").count(entity) and sentence.count(entity) , but both result wrong counting.
Any ideas?

Answer 1

You could use a pattern with an alternation matching either aa bb or aa surrounded with word boundaries \b to prevent the match being part of a longer word.

To reference the groups in code, you might name them like entity_1 and entity_2 instead of using the group numbers.

Loop the results using for example re.finditer and increment a counter if the found group is not None.

Pattern

\b(?:(?P<entity_1>aa bb)|(?P<entity_2>aa))\b

Explanation

\b Word boundary
(?: Non capture group with alternatives
- (?P<entity_1>aa bb) Named group entity_1 match aa bb
- | Or
- (?P<entity_2>aa) Named group entity_2 match aa
)
\b Word boundary

Regex demo

Example code

import re

regex = r"\b(?:(?P<entity_2>aa bb)|(?P<entity_1>aa))\b"
test_str = "aa bb aa, aabb aa"
entity_1_count = 0
entity_2_count = 0

matches = re.finditer(regex, test_str)

for matchNum, match in enumerate(matches):
    if match.group("entity_1") is not None:
        entity_1_count += 1

    if match.group("entity_2") is not None:
        entity_2_count += 1

print(entity_1_count)
print(entity_2_count)

Output

2
1

how to count “natural” appearances of certain entities in a string python

Question

1 answers

solution1
0 2020-04-30 20:42:04

how to count “natural” appearances of certain entities in a string python

Question

1 answers

solution1 0 2020-04-30 20:42:04

solution1
0 2020-04-30 20:42:04