creating a regular expression to find sequences of repeated tags in a tagged text

Question

I'm trying to write a regular expression that will find compound noun phrases, such as "weapons production facilities" or "EPA air quality regulation announcements," in a text that's been tagged with a part of speech tagger. I only want to find compound noun phrases that are 3 or more words long. So I scrape off the tags from the tagged text and then look for three or more noun tags in a row. Here's what I have:

stringOfTags = 'DET NN NN NNS IN DET NN NN VBD JJ NNP NN NN NNS '

pattern = re.compile(r"(NN[SP]? ){3,}")
match = pattern.findall(stringOfTags)
for item in match:
    print item

And this is the output, which is not what I want at all:

NNS
NN

Instead, I want it to find 'NN NN NNS' and 'NNP NN NN NNS' from stringOfTags. Can anyone help me with creating a regex that will find strings of 3 or more nouns tags in a row?

Answer 1

You can replace the capturing group ( ) with a non-capturing group (?:

pattern = re.compile(r"(?:NN[SP]? ){3,}")

Or use a non-capturing group, enclosing it with a capturing group.

pattern = re.compile(r"((?:NN[SP]? ){3,})")

Final solution:

import re

stringOfTags = 'DET NN NN NNS IN DET NN NN VBD JJ NNP NN NN NNS '

pattern = re.compile(r"(?:NN[SP]? ){3,}")
match   = pattern.findall(stringOfTags)

for item in match:
    print item

Output

NN NN NNS 
NNP NN NN NNS

Answer 2

import re

stringOfTags = 'DET NN NN NNS IN DET NN NN VBD JJ NNP NN NN NNS '

pattern = re.compile(r"((?:NN[SP]? ){3,})")
match = pattern.findall(stringOfTags)
for item in match:
    print(item)

produces

NN NN NNS 
NNP NN NN NNS

creating a regular expression to find sequences of repeated tags in a tagged text

Question

2 answers

solution1
1 2014-05-07 00:31:46

solution2
0 2014-05-07 00:25:01

creating a regular expression to find sequences of repeated tags in a tagged text

Question

2 answers

solution1 1 2014-05-07 00:31:46

solution2 0 2014-05-07 00:25:01

solution1
1 2014-05-07 00:31:46

solution2
0 2014-05-07 00:25:01