Extract words begin with capital letters

Question

I have a string like this

text1="sedentary. Allan Takocok. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."

I want to extract words in this text that begin with a capital letter but do not follow a fullstop. So [Takocok The New England Journal of Medicine] should be extracted without [That's Allan].

I tried this regex but still extracting Allan and That's.

t=re.findall("((?:[A-Z]\w+[ -]?)+)",text1)

Answer 1

Here is an option using re.findall :

text1 = "sedentary. Allan Takocok. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."
matches = re.findall(r'(?:(?<=^)|(?<=[^.]))\s+([A-Z][a-z]+)', text1)
print(matches)

This prints:

['Takocok', 'The', 'New', 'England', 'Journal', 'Medicine']

Here is an explanation of the regex pattern:

(?:(?<=^)|(?<=[^.]))   assert that what precedes is either the start of the string,
                       or a non full stop character
\s+                    then match (but do not capture) one or more spaces
([A-Z][a-z]+)          then match AND capture a word starting with a capital letter

Answer 2

This should be the regex your looking for:

(?<!\.)\s+([A-Z][A-Za-z]+)

See the regex101 here: https://regex101.com/r/EoPqgw/1

Answer 3

It's probably possible to find a single regular expression for this case, but it tends to get messy.

Instead, I suggest a two-step approach:

split the text into tokens
work on these tokens to extract the interesting words

tokens = [
    'sedentary',
    '.',
    ' ',
    'Allan',
    ' ',
    'Takocok',
    '.',
    ' ',
    'That\'s',
    …
]

This token splitting is already complicated enough.

Using this list of tokens, it is easier to express the actual requirements since you now work on well-defined tokens instead of arbitrary character sequences.

I kept the spaces in the token list because you might want to distinguish between 'a.dotted.brand.name' or 'www.example.org' and the dot at the end of a sentence.

Using this token list, it is easier than before to express rules like "must be preceded immediately by a dot".

I expect that your rules get quite complicated over time since you are dealing with natural language text. Therefore the abstraction to tokens.

Extract words begin with capital letters

Question

3 answers

solution1
2 ACCPTED 2019-07-30 04:57:02

solution2
1 2019-07-30 04:56:11

solution3
1 2019-07-30 05:00:51

Extract words begin with capital letters

Question

3 answers

solution1 2 ACCPTED 2019-07-30 04:57:02

solution2 1 2019-07-30 04:56:11

solution3 1 2019-07-30 05:00:51

solution1
2 ACCPTED 2019-07-30 04:57:02

solution2
1 2019-07-30 04:56:11

solution3
1 2019-07-30 05:00:51