简体   繁体   中英

How to find all words with first letter as upper case using Python Regex

I need to find all the words in a file which start with an upper case, I tried the below code but it returns an empty string.

import os
import re

matches = []

filename = 'C://Users/Documents/romeo.txt'
with open(filename, 'r') as f:
    for line in f:
        regex = "^[A-Z]\w*$"
        matches.append(re.findall(regex, line))
print(matches)

File:

Hi, How are You?

Output:

[Hi,How,You]

You can use

import os, re

matches = []
filename = r'C:\Users\Documents\romeo.txt'
with open(filename, 'r') as f:
    for line in f:
        matches.extend([x for x in re.findall(r'\w+', line) if x[0].isupper()])
print(matches)

The idea is to extract all words with a simple \w+ regex and add only those to the final matches list that start with an uppercase letter.

See the Python demo .

NOTE : If you want to only match letter words use r'\b[^\W\d_]+\b' regex.

This approach is Unicode friendly, that is, any Unicode word with the first capitalized letter will be found.

You also ask :

Is there a way to limit this to only words that start with an upper case letter and not all uppercase words

You can extend the previous code to

[x for x in re.findall(r'\w+', line) if x[0].isupper() and not x.isupper()]

See this Python demo , "Hi, How ARE You?" yields ['Hi', 'How', 'You'] .

Or, to avoid getting CaMeL words in the output, use

matches.extend([x for x in re.findall(r'\w+', line) if x[0].isupper() and all(i.islower() for i in x[1:])])

See this Python demo where all(i.islower() for i in x[1:]) makes sure all letters after the first one are all lowercase.

Fully regex approach

You can use PyPi regex module that has support for both Unicode property and POSIX character classes, \p{Lu} / \p{Ll} and [:upper:] / [:lower:] . So, the solution will look like

import regex
text = "Hi, How ARE You?"
# Word starting with an uppercase letter:
print( regex.findall(r'\b\p{Lu}\p{L}*\b', text) )
## => ['Hi', 'How', 'ARE', 'You']
# Word starting with an uppercase letter but not ALLCAPS:
print( regex.findall(r'\b\p{Lu}\p{Ll}*\b', text) )
## => ['Hi', 'How', 'You']

See the Python demo online where

  • \b - a word boundary
  • \p{Lu} - any uppercase letter
  • \p{L}* - any zero or more letters
  • \p{Ll}* - any zero or more lowercase letters

You can use a word boundary instead of the anchors ^ and $

\b[A-Z]\w*

Regex demo

Note that if you use matches.append , you add an item to the list and re.findall returns a list, which will give you a list of lists.

import re

matches = []
regex = r"\b[A-Z]\w*"
filename = r'C:\Users\Documents\romeo.txt'
with open(filename, 'r') as f:
    for line in f:
        matches += re.findall(regex, line)
print(matches)

Output

['Hi', 'How', 'You']

If there should be a whitespace boundary to the left, you could also use

(?<!\S)[A-Z]\w*

Regex demo


If you don't want to match words using \w with only uppercase chars, you could use for example a negative lookahead to assert not only uppercase chars till a word boundary

\b[A-Z](?![A-Z]*\b)\w*
  • \b A word boundary to prevent a partial match
  • [AZ] Match an uppercase char AZ
  • (?![AZ]*\b) Negative lookahead, assert not only uppercase chars followed by a word boundary
  • \w* Match optional word chars

Regex demo


To match a word that starts with an uppercase char, and does not contain any more uppercase chars:

\b[A-Z][^\WA-Z]*\b
  • \b A word boundary
  • [AZ] Match an uppercase char AZ
  • [^\WA-Z]* Optionally match a word char without chars AZ
  • \b A word boundary

Regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM