简体   繁体   中英

regex for specific group of parts of speech selection

Hey i have been trying regex to select these pos patterns

JJ JJ JJ JJ 
JJ JJ JJ NNS 
JJ JJ NN NN 
JJ JJ NN
JJ JJ NNS
JJ JJ RB
JJ JJ
JJ NN IN DT JJ
JJ NN JJ NNS
JJ NN JJ
JJ NN NN NN
JJ NN NN
JJ NN NNS
JJ NN
JJ NNP
JJ NNS IN NN
JJ NNS IN NN
JJ NNS NN
JJ NNS NNS
JJ NNS
JJ VBG NNS
JJ VBZ NNS
JJR NN

I have tried with the below regex and it doesn't seem to be selecting everything can someone help me with this.

(((JJ|NN)\w?)+ ((NN\w?\s?)+|(JJ\s?)+|(RB\s?)+|(IN\s?)+|(DT\s?)+|(VB\w?\s?)+))

You could simply build an expression from the list of patterns with a | separator. You only need to ensure that longer patterns come before shorter ones because the | operator is not greedy:

patterns = """JJ JJ JJ JJ 
JJ JJ JJ NNS 
JJ JJ NN NN 
JJ JJ NN
JJ JJ NNS
JJ JJ RB
JJ JJ
JJ NN IN DT JJ
JJ NN JJ NNS
JJ NN JJ
JJ NN NN NN
JJ NN NN
JJ NN NNS
JJ NN
JJ NNP
JJ NNS IN NN
JJ NNS IN NN
JJ NNS NN
JJ NNS NNS
JJ NNS
JJ VBG NNS
JJ VBZ NNS
JJR NN"""

import re
pattern = "|".join(sorted(patterns.split("\n"),key=len,reverse=True))

results = re.findall(pattern,patterns) # finds them all in 0.009 ms

This is 3 times faster than a complex expression:

pattern = '^((JJ(?:R)*\s*)+\s*((((NN(?:S|P)*|VB(:?G|Z)*|RB|JJ)\s*))+\s*)+(((IN|NN(?:S|P)*|DT|JJ)\s*))*)$'
results = re.findall(pattern,patterns) 
# takes forever (possibly because of new lines in the text).

# taking end of lines out of the equation:
singleLine = patterns.replace("\n","*")
result     = re.findall(pattern, singleLine)
# takes 0.030 ms

A regex solution, less efficient than @Alain. It matches all the desired strings and none other:

^JJR? (?:JJ|NN[PS]?|VB[GZ]?)(?: (?:JJ|NNS?|RB|IN)(?: (?:JJ|NNS?|DT)(?: JJ)?)?)? *$

Demo

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM