简体   繁体   中英

Spacy Regex Phrase Matcher in Python

In a large corpus of text, I am interested in extracting every sentence which has a specific list of (Verb-Noun) or (Adjective-Noun) somewhere in the sentence. I have a long list but here is a sample. In my MWE I am trying to extract sentences with "write/wrote/writing/writes" and "book/s". I have around 30 such pairs of words.

Here is what I have tried but it's not catching most of the sentences:

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

doc = nlp(u'Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.\
While writing this book, he had to fend off aliens and dinosaurs. Greene\'s second book might not have been written by him. \
Greene\'s cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around \
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.')

matcher = Matcher(nlp.vocab)
pattern1 = [{"LEMMA": "write"},{"TEXT": {"REGEX": ".+"}},{"LEMMA": "book"}]
matcher.add("testy", None, pattern)

for sent in doc.sents:
    if matcher(nlp(sent.lemma_)):
        print(sent.text)

Unfortunately, I am only getting one match:

"While writing this book, he had to fend off aliens and dinosaurs."

Whereas, I expect to get the "He wrote his first book" sentence as well. The other write-books have writer as a noun to its good that its not matching.

The issue is that in the Matcher, by default each dictionary in the pattern corresponds to exactly one token . So your regex doesn't match any number of characters, it matches any one token, which isn't what you want.

To get what you want, you can use the OP value to specify that you want to match any number of tokens. See the operators or quantifiers section in the docs.

However, given your problem, you probably want to actually use the Dependency Matcher instead, so I rewrote your code to use that as well. Try this:

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

doc = nlp("""
Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.
While writing this book, he had to fend off aliens and dinosaurs. Greene's second book might not have been written by him. 
Greene's cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around 
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.""")

matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "write"},{"OP": "*"},{"LEMMA": "book"}]
matcher.add("testy", [pattern])

print("----- Using Matcher -----")
for sent in doc.sents:
    if matcher(sent):
        print(sent.text)

print("----- Using Dependency Matcher -----")

deppattern = [
        {"RIGHT_ID": "wrote", "RIGHT_ATTRS": {"LEMMA": "write"}},
        {"LEFT_ID": "wrote", "REL_OP": ">", "RIGHT_ID": "book", 
            "RIGHT_ATTRS": {"LEMMA": "book"}}
        ]

from spacy.matcher import DependencyMatcher

dmatcher = DependencyMatcher(nlp.vocab)

dmatcher.add("BOOK", [deppattern])

for _, (start, end) in dmatcher(doc):
    print(doc[start].sent)

One other, less important thing - the way you were calling the matcher was kind of weird. You can pass the matcher Docs or Spans, but they should definitely be natural text, so calling .lemma_ on the sentence and creating a fresh doc from that worked in your case, but in general should be avoided.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM