简体   繁体   中英

How do I apply any regex to my tagged text in python 3

I have a text. I tokenize it and remove stopwords. then I tag these words using stanford POS tagger in python. For now, I am using this code for tagging words and writing it in a file.

tag = nltk.pos_tag(filtered_sentence)
print("tagging the words")
fh = open("Stop_Words.txt", "w+")
for i in range(0,len(filtered_sentence)):
    fh.write((tag[i][0])+" "+(tag[i][1])+"\n")
fh.close()

Now I get a list something like this in my file:

paper NN
parallel NN
programming VBG
practical JJ
Greg NNP
Wilson NNP
intended VBD
scientist NN
interested JJ
... A big List ...

What I want to do now is to apply some Regex to this to find particular cases. For example, I want something like (JJ*N+) which means adjective followed by any noun. I did N+ because NN,NNP etc all are nouns.

How should I do this. I am clueless.Any help will be appreciated.

If you only want JJ*N you could do something like this:

import re

text = '''paper NN
parallel NN
programming VBG
practical JJ
Greg NNP
Wilson NNP
intended VBD
scientist NN
interested JJ
'''

pattern = re.compile('\w+? JJ\n\w+ NN.?', re.DOTALL)

result = pattern.findall(text)
print(result)

Output

['practical JJ\nGreg NNP']

Explanation

The pattern '\\w+? JJ\\n\\w+ NN.?' '\\w+? JJ\\n\\w+ NN.?' matches a group of letters \\w+ , followed by a space followed by JJ followed by a \\n followed by another group of letters followed by something with NN prefix. Note that I included both words because I think it might be useful for your purposes.

UPDATE

If you want zero or more adjectives JJ* followed by one or more nouns NN+ you could do something like this:

import re

text = '''paper NN
parallel NN
programming VBG
practical JJ
Greg NNP
Wilson NNP
intended VBD
scientist NN
interested JJ
'''

pattern = re.compile('(\w+? JJ\n)*(\w+ NN\w?)+', re.DOTALL)

result = pattern.finditer(text)
for element in result:
    print(element.group())
    print('----')

Output

paper NN
----
parallel NN
----
practical JJ
Greg NNP
----
Wilson NNP
----
scientist NN
----

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM