简体   繁体   中英

Regex: Separating All Caps from Numbers

I am using python regex to read documents.

I have the following line in many documents:

Dated: February 4, 2011 THE REAL COMPANY, INC

I can use python text search to easily find the lines that have "dated," but I want to pull THE REAL COMPANY, INC from the text without getting the "February 4, 2011" text.

I have tried the following:

[A-Z\s]{3,}.*INC

My understanding of this regex is it should get me all capital letters and spaces before LLP, but instead it pulls the full line.

This suggests to me I'm fundamentally missing something about how regex works with capital letters. Is there an easy and obvious explanation I'm missing?

what about using:

>>> import re
>>> txt
'Dated: February 4, 2011 THE REAL COMPANY, INC'

>>> re.findall('([A-Z][A-Z]+)', txt)
['THE', 'REAL', 'COMPANY', 'INC']

Another way around is as follows as suggested by @davedwards:

>>> re.findall('[A-Z\s]{3,}.*', txt)
[' THE REAL COMPANY, INC']

Explanation:

 [AZ\\s]{3,}.* Match a single character present in the list below [AZ\\s]{3,} {3,} Quantifier — Matches between 3 and unlimited times, as many times as possible, giving back as needed (greedy) AZ a single character in the range between A (index 65) and Z (index 90) (case sensitive) \\s matches any whitespace character (equal to [\\r\\n\\t\\f\\v ]) .* matches any character (except for line terminators) * Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy) 

You could use

^Dated:.*?\s([A-Z ,]{3,})

And make use of the first capturing group, see a demo on regex101.com .

Your regex [AZ\\s]{3,}.*INC matches 3 or more times an uppercase character or a whitespace character followed by 0+ times any character and then INC which will match: THE REAL COMPANY, INC

What you could also do is match Dated: from the start of the string followed by a date like format and then capture what comes after in a group. Your value will be in the first capturing group:

^Dated:\\s+\\S+\\s+\\d{1,2},\\s+\\d{4}\\s+(.*)$

Explanation

  • ^Dated:\\s+ Match dated: followed by 1+ times a whitespace character
  • \\S+\\s+ Match 1+ times not a whitespace character followed by 1+ times a whitespace character whic will match February in this case
  • \\d{1,2}, Match 1-2 times a digit
  • \\s+\\d{4}\\s+ match 1+ times a whitespace character, 4 digits, followed by 1+ times a whitespace character
  • (.*) Capture in a group 0+ times any character
  • $ Assert the end of the string

Regex demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM