简体   繁体   中英

How can I use a regex to search a PDF for all words within parentheses EXCEPT for a specific set of words?

I am trying to search through an 8-page PDF file for all words within parentheses EXCEPT for "(EAI), (EY)" and a few others. I am using a regex and can get all say three letter words within parentheses to pull, but I don't know how to exclude what I want to exclude.

import re
lines = text.split()
search = "\(\D{3}\)"
regex = re.compile(search)

for line in lines:
    three_letters= regex.findall(line)
    for word in three_letters:
    print(word)

I get the following list:

(FBS) (NFS) (IAD) (CDs) (CDs) (EAI) (EAI) (EAI) (VIG) (EAI) (EAI) (NTF) (DRP) (EAI) (IAD)

But I need a handful of them excluded.

I've been banging my head on this one for a while please help!!

Use the findall function with this (matches 3 letters)

\\((?!(?:list|of|stuff|you|don't|want)\\))[AZ]{3}\\)

Formatted

 \(
 (?!
      (?:
           list
        |  of
        |  stuff
        |  you
        |  don't
        |  want 
      )
      \)
 )
 [A-Z]{3} 
 \)

Specify a range to make it variable.
This example matches 2 to 5 letters {2,5} .
Or, 2 to no upper limit is just {2,}

\\((?!(?:list|of|stuff|you|don't|want)\\))[AZ]{2,5}\\)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM