简体   繁体   中英

Extract words begin with capital letters

I have a string like this

text1="sedentary. Allan Takocok. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."

I want to extract words in this text that begin with a capital letter but do not follow a fullstop. So [Takocok The New England Journal of Medicine] should be extracted without [That's Allan].

I tried this regex but still extracting Allan and That's.

t=re.findall("((?:[A-Z]\w+[ -]?)+)",text1)

Here is an option using re.findall :

text1 = "sedentary. Allan Takocok. That's the conclusion of two studies published in this week's issue of The New England Journal of Medicine."
matches = re.findall(r'(?:(?<=^)|(?<=[^.]))\s+([A-Z][a-z]+)', text1)
print(matches)

This prints:

['Takocok', 'The', 'New', 'England', 'Journal', 'Medicine']

Here is an explanation of the regex pattern:

(?:(?<=^)|(?<=[^.]))   assert that what precedes is either the start of the string,
                       or a non full stop character
\s+                    then match (but do not capture) one or more spaces
([A-Z][a-z]+)          then match AND capture a word starting with a capital letter

This should be the regex your looking for:

(?<!\.)\s+([A-Z][A-Za-z]+)

See the regex101 here: https://regex101.com/r/EoPqgw/1

It's probably possible to find a single regular expression for this case, but it tends to get messy.

Instead, I suggest a two-step approach:

  1. split the text into tokens
  2. work on these tokens to extract the interesting words
tokens = [
    'sedentary',
    '.',
    ' ',
    'Allan',
    ' ',
    'Takocok',
    '.',
    ' ',
    'That\'s',
    …
]

This token splitting is already complicated enough.

Using this list of tokens, it is easier to express the actual requirements since you now work on well-defined tokens instead of arbitrary character sequences.

I kept the spaces in the token list because you might want to distinguish between 'a.dotted.brand.name' or 'www.example.org' and the dot at the end of a sentence.

Using this token list, it is easier than before to express rules like "must be preceded immediately by a dot".

I expect that your rules get quite complicated over time since you are dealing with natural language text. Therefore the abstraction to tokens.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM