Find words with capital letters not at start of a sentence with regex

Question

Using Python and regex I am trying to find words in a piece of text that start with a capital letter but are not at the start of a sentence.

The best way I can think of is to check that the word is not preceded by a full stop then a space. I am pretty sure that I need to use negative lookbehind. This is what I have so far, it will run but always returns nothing:

(?<!\.\s)\b[A-Z][a-z]*\b

I think the problem might be with the use of [AZ][az]* inside the word boundary \\b but I am really not sure.

Thanks for the help.

Answer 1

Your regex appears to work:

In [6]: import re

In [7]: re.findall(r'(?<!\.\s)\b[A-Z][a-z]*\b', 'lookbehind. This is what I have')
Out[7]: ['I']

Make sure you're using a raw string ( r'...' ) when specifying the regex.

If you have some specific inputs on which the regex doesn't work, please add them to your question.

Answer 2

Although you asked specifically for a regex, it may be interesting to also consider a list comprehension. They're sometimes a bit more readable (although in this case, probably at the cost of efficiency). Here's one way to achieve this:

import string

S = "T'was brillig, and the slithy Toves were gyring and gimbling in the " + \
    "Wabe. All mimsy were the Borogoves, and the Mome Raths outgrabe."

LS = S.split(' ')

words = [x for (pre,x) in zip(['.']+LS, LS+[' '])
    if (x[0] in string.uppercase) and (pre[-1] != '.')]

Answer 3

Try and loop over your input with:

(?!^)\b([A-Z]\w+)

and capture the first group. As you can see, a negative lookahead can be used as well, since the position you want to match is everything but a beginning of line. A negative lookbehind would have the same effect.

Find words with capital letters not at start of a sentence with regex

Question

3 answers

solution1
2 ACCPTED 2012-01-05 16:20:49

solution2
1 2012-01-05 16:45:34

solution3
0 2012-01-05 16:21:03

Find words with capital letters not at start of a sentence with regex

Question

3 answers

solution1 2 ACCPTED 2012-01-05 16:20:49

solution2 1 2012-01-05 16:45:34

solution3 0 2012-01-05 16:21:03

solution1
2 ACCPTED 2012-01-05 16:20:49

solution2
1 2012-01-05 16:45:34

solution3
0 2012-01-05 16:21:03