简体   繁体   中英

how to count “natural” appearances of certain entities in a string python

I have the following string:

sentence = 'aa bb aa, aabb aa.'

I'm looking for a way to count the following entities:

entity_1 = 'aa'
entity_2 = 'aa bb'

the end result should be this:
entity_1 count = 2 (only 'aa,' and 'aa.', the 'aa' in 'aa bb' should not count as 'aa bb' is its own entity)
entity_2 count = 1 (only 'aa bb')

I've tried using sentence.split(" ").count(entity) and sentence.count(entity) , but both result wrong counting.
Any ideas?

You could use a pattern with an alternation matching either aa bb or aa surrounded with word boundaries \b to prevent the match being part of a longer word.

To reference the groups in code, you might name them like entity_1 and entity_2 instead of using the group numbers.

Loop the results using for example re.finditer and increment a counter if the found group is not None.

Pattern

\b(?:(?P<entity_1>aa bb)|(?P<entity_2>aa))\b

Explanation

  • \b Word boundary
  • (?: Non capture group with alternatives
    • (?P<entity_1>aa bb) Named group entity_1 match aa bb
    • | Or
    • (?P<entity_2>aa) Named group entity_2 match aa
  • )
  • \b Word boundary

Regex demo

Example code

import re

regex = r"\b(?:(?P<entity_2>aa bb)|(?P<entity_1>aa))\b"
test_str = "aa bb aa, aabb aa"
entity_1_count = 0
entity_2_count = 0

matches = re.finditer(regex, test_str)

for matchNum, match in enumerate(matches):
    if match.group("entity_1") is not None:
        entity_1_count += 1

    if match.group("entity_2") is not None:
        entity_2_count += 1

print(entity_1_count)
print(entity_2_count)

Output

2
1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM