简体   繁体   中英

Regex to match a list of key words

I have a list of words that would identify a particular section of a document. There can be variations in how the key words are used. However these key words blend with the document text and I know only a rudimentary way of doing it.

Some sample key words would be Assessment, Plan, Family History, Current Medications, Procedures, Allergies etc etc...

Some sample text is here:

 Family History
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX      
 Social History
  · No alcohol use
 Current Meds
 Allergies
  · No Known Drug Allergies      
 Vitals
 Vital Signs [Data Includes: Current Encounter] 
 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX    
    Height     Tall 
    Weight     Well Built               
Physical Exam
Lorem Ipsum is simply dummy text of the printing and typesetting industry
Lorem Ipsum has been the industry's standard dummy text ever since the
1500s, when an unknown printer took a galley of type and scrambled it to    
Assessment
History of Medication
      None
Plan
It is a long established fact that a reader will be distracted by
readable content of a page when looking at its layout. The point of using
Lorem Ipsum is that it has a more-or-less normal distribution of letters,

This is what I have working so far

'.*\bPlan\b|.*\bHistory\b|.*\bMeds\b'

Is there a better way of finding a list of terms (case insensitive) using Regex in Python ?

What you have should be equivalent to

.*\b(Plan|History|Meds)\b

Having .* at the beginning is redundant, just use search instead of match to say that the regex can be found anywhere.

However what you probably really want is to make sure that these words are the first 'real' thing to appear in the line, so I'd recommend:

\s*(Plan|...

to say that only whitespace should appear at the beginning, or

\W*(Plan|...

if you need more flexibility, eg bullet points ( \\W means not word characters).

Update for additional question in comment:

Here's an example of a regex that only matches up to 4 words:

^(\W*\w+\W*){0,4}\W*$

Test:

for i in range(1, 6):
    print bool(re.match(r"^(\W*\w+\W*){0,4}\W*$", "abc " * i))

prints 4 True s and one False .

I tried to do it with word boundaries but gave up. Honestly you'd be better of counting the number of words with a simpler regex. Don't use regular expressions unless they really feel right for a task: code in general is more powerful and is often so much easier.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM