简体   繁体   中英

Regex python find uppercase names

I have a text file of the type:

[...speech...]

NAME_OF_SPEAKER_1: [...speech...]

NAME_OF_SPEAKER_2: [...speech...]

My aim is to isolate the speeches of the various speakers. They are clearly identified because the name of each speaker is always indicated in uppercase letters (name+surname). However, in the speeches there can be nouns (not people's names) which are in uppercase letter, but there is only one word that is actually long enough to give me issue (it has four letter, say it is 'ABCD'). I was thinking to identifiy the position of each speaker's name (I assume every name long at least 3 letters) with something like

re.search('[A-Z^(ABCD)]{3,}',text_to_search)

in order to exclude that specific (constant) word 'ABCD'. However, the command identifies that word instead of excluding it. Any ideas about how to overcome this problem?

Square brackets [] match single characters, only. Also round brackets () inside of square brackets match single characters, only. That means:

[ABCD] and [(ABCD)] are the same as [AD] .

[^(ABCD)] matches any character, which is not one of AD

I would try something different:

^[AZ]*?: matches each word written in capital letters, which starts at the beginning of a line, and is followed by a colon

In the pattern that you tried, you get partial matches, as there are no boundaries and [AZ^(ABCD)]{3,} will match 3 or more times any of the listed characters.

AZ will also match ABCD, so it could also be written as [AZ^)(]{3,}

Instead of using the negated character class, you could assert that the word that consists only of uppercase chars AZ does not contain ABCD using a negative lookahead (?!

\b(?![A-Z]*ABCD)[A-Z]{3,}\b

Regex demo

If the name should start with 3 uppercase char, and can contain also lowercase chars, an underscore or digits, you could add \w* after matching 3 uppercase chars:

\b(?![A-Z]*ABCD)[A-Z]{3}\w*\b

Regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM