Using Python and regex I am trying to find words in a piece of text that start with a capital letter but are not at the start of a sentence.
The best way I can think of is to check that the word is not preceded by a full stop then a space. I am pretty sure that I need to use negative lookbehind. This is what I have so far, it will run but always returns nothing:
(?<!\.\s)\b[A-Z][a-z]*\b
I think the problem might be with the use of [AZ][az]* inside the word boundary \\b but I am really not sure.
Thanks for the help.
Your regex appears to work:
In [6]: import re
In [7]: re.findall(r'(?<!\.\s)\b[A-Z][a-z]*\b', 'lookbehind. This is what I have')
Out[7]: ['I']
Make sure you're using a raw string ( r'...'
) when specifying the regex.
If you have some specific inputs on which the regex doesn't work, please add them to your question.
Although you asked specifically for a regex, it may be interesting to also consider a list comprehension. They're sometimes a bit more readable (although in this case, probably at the cost of efficiency). Here's one way to achieve this:
import string
S = "T'was brillig, and the slithy Toves were gyring and gimbling in the " + \
"Wabe. All mimsy were the Borogoves, and the Mome Raths outgrabe."
LS = S.split(' ')
words = [x for (pre,x) in zip(['.']+LS, LS+[' '])
if (x[0] in string.uppercase) and (pre[-1] != '.')]
Try and loop over your input with:
(?!^)\b([A-Z]\w+)
and capture the first group. As you can see, a negative lookahead can be used as well, since the position you want to match is everything but a beginning of line. A negative lookbehind would have the same effect.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.