I have the following string:
sentence = 'aa bb aa, aabb aa.'
I'm looking for a way to count the following entities:
entity_1 = 'aa'
entity_2 = 'aa bb'
the end result should be this:
entity_1 count = 2 (only 'aa,' and 'aa.', the 'aa' in 'aa bb' should not count as 'aa bb' is its own entity)
entity_2 count = 1 (only 'aa bb')
I've tried using sentence.split(" ").count(entity)
and sentence.count(entity)
, but both result wrong counting.
Any ideas?
You could use a pattern with an alternation matching either aa bb
or aa
surrounded with word boundaries \b
to prevent the match being part of a longer word.
To reference the groups in code, you might name them like entity_1
and entity_2
instead of using the group numbers.
Loop the results using for example re.finditer and increment a counter if the found group is not None.
Pattern
\b(?:(?P<entity_1>aa bb)|(?P<entity_2>aa))\b
Explanation
\b
Word boundary (?:
Non capture group with alternatives
(?P<entity_1>aa bb)
Named group entity_1 match aa bb
|
Or(?P<entity_2>aa)
Named group entity_2 match aa
)
\b
Word boundary Example code
import re
regex = r"\b(?:(?P<entity_2>aa bb)|(?P<entity_1>aa))\b"
test_str = "aa bb aa, aabb aa"
entity_1_count = 0
entity_2_count = 0
matches = re.finditer(regex, test_str)
for matchNum, match in enumerate(matches):
if match.group("entity_1") is not None:
entity_1_count += 1
if match.group("entity_2") is not None:
entity_2_count += 1
print(entity_1_count)
print(entity_2_count)
Output
2
1
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.