简体   繁体   中英

Find words with capital letters not at start of a sentence with regex

Using Python and regex I am trying to find words in a piece of text that start with a capital letter but are not at the start of a sentence.

The best way I can think of is to check that the word is not preceded by a full stop then a space. I am pretty sure that I need to use negative lookbehind. This is what I have so far, it will run but always returns nothing:

(?<!\.\s)\b[A-Z][a-z]*\b

I think the problem might be with the use of [AZ][az]* inside the word boundary \\b but I am really not sure.

Thanks for the help.

Your regex appears to work:

In [6]: import re

In [7]: re.findall(r'(?<!\.\s)\b[A-Z][a-z]*\b', 'lookbehind. This is what I have')
Out[7]: ['I']

Make sure you're using a raw string ( r'...' ) when specifying the regex.

If you have some specific inputs on which the regex doesn't work, please add them to your question.

Although you asked specifically for a regex, it may be interesting to also consider a list comprehension. They're sometimes a bit more readable (although in this case, probably at the cost of efficiency). Here's one way to achieve this:

import string

S = "T'was brillig, and the slithy Toves were gyring and gimbling in the " + \
    "Wabe. All mimsy were the Borogoves, and the Mome Raths outgrabe."

LS = S.split(' ')

words = [x for (pre,x) in zip(['.']+LS, LS+[' '])
    if (x[0] in string.uppercase) and (pre[-1] != '.')]

Try and loop over your input with:

(?!^)\b([A-Z]\w+)

and capture the first group. As you can see, a negative lookahead can be used as well, since the position you want to match is everything but a beginning of line. A negative lookbehind would have the same effect.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM