[英]How to count occurences of a word following by a special character in a text using python regular expression
I want to count the number of occurrences of the word 'people' in a text using python. 我想使用python计算文本中“人”一词的出现次数。 For that I use Counter and Python's regular expression: 为此,我使用Counter和Python的正则表达式:
for j in range(len(paragraphs)):
text = paragraphs[j].text
count[j] = Counter(re.findall(r'\bpeople\b' ,text))
Yet, here my code does not take into account of the occurrences of people. 但是,这里的代码没有考虑到人的出现。 people! 人! people? 人? How can I modify it to also count the cases when the word is followed by a specific character? 我如何修改它以计算单词后面跟有特定字符的情况?
Thank you for you help, 谢谢你的帮助,
You can use an optional character-group in your regex: 您可以在正则表达式中使用可选的字符组:
r'\bpeople[.,!?]?\b'
The ? ? specifies it can occure 0 or 1 times - the []
specifies what characters are allowed. 指定它可以出现0或1次- []
指定允许的字符。 There is no need to escape the .
没有必要逃脱.
(or fe ()*+?
) inside []
although they have special meaning for regex. (或[]
fe ()*+?
),尽管它们对正则表达式有特殊含义。 If you wanted to use a -
inside []
you would need to escape it as it is used to denote ranges in sets [1-5]
== 12345
. 如果要使用-
内部[]
,则需要对其进行转义,因为它用于表示集合[1-5]
== 12345
。
See: https://docs.python.org/3/library/re.html#regular-expression-syntax 请参阅: https : //docs.python.org/3/library/re.html#regular-expression-syntax
[] Used to indicate a set of characters. []用于指示一组字符。 In a set: 在一组中:
Characters can be listed individually, eg [amk] will match 'a', 'm', or 'k'. 字符可以单独列出,例如[amk]将匹配“ a”,“ m”或“ k”。 Ranges of characters can be indicated by giving two characters and separating them by a '-', for example [az] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. 可以通过给出两个字符并用'-'隔开来表示字符范围,例如[az]将匹配任何小写的ASCII字母,[0-5] [0-9]将匹配所有的两位数字00到59,并且[0-9A-Fa-f]将匹配任何十六进制数字。 [...] [...]
people[?.!]
This will allow you to only match with people? 这将使您只与人匹配吗? people. 人。 and/or people! 和/或人!
So if you add a few more Counter(re.finall(
you will be able to do something like this 因此,如果您再添加一些Counter(re.finall(
#This will only match people
count[j] = Counter(re.findall(r'people\s' ,text))
#This will only match people?
count[j] = Counter(re.findall(r'people\?' ,text))
#This will only match people.
count[j] = Counter(re.findall(r'people\.' ,text))
#This will only match people!
count[j] = Counter(re.findall(r'people\!' ,text))
You need to use the \\
to escape the special characters 您需要使用\\
来转义特殊字符
Also this is a good resource when you are experimenting with python regular expressions: https://pythex.org/ The site also has a regular expression cheat sheet 当您尝试使用python正则表达式时,这也是一个很好的资源: https : //pythex.org/该站点也有一个正则表达式备忘单
You can use a modifier statement at the end of the 'people' part of your Regex pattern. 您可以在Regex模式的“ people”部分的末尾使用修饰符语句。 Try the following: 请尝试以下操作:
for j in range(len(paragraphs)):
text = paragraphs[j].text
count[j] = Counter(re.findall('r\bpeople[.?!]?\b', text)
The ? ? is for zero or more quantifier. 用于零个或多个量词。 The above pattern seems to work on regex101.com but I haven't tried in out in a Python shell yet. 上面的模式似乎可以在regex101.com上运行,但是我还没有在Python shell中尝试过。
Does it have to use regex? 是否必须使用正则表达式? Why not just: 为什么不只是:
len(text.split("people"))-1
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.