簡體   English   中英

Python 中的 Spacy 正則表達式短語匹配器

[英]Spacy Regex Phrase Matcher in Python

在大量文本語料庫中,我有興趣提取句子中某處具有(動詞-名詞)或(形容詞-名詞)特定列表的每個句子。 我有一個很長的清單,但這里有一個示例。 在我的 MWE 中,我試圖用“write/wrote/writing/writes”和“book/s”提取句子。 我有大約 30 對這樣的詞。

這是我嘗試過的,但它沒有捕捉到大多數句子:

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

doc = nlp(u'Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.\
While writing this book, he had to fend off aliens and dinosaurs. Greene\'s second book might not have been written by him. \
Greene\'s cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around \
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.')

matcher = Matcher(nlp.vocab)
pattern1 = [{"LEMMA": "write"},{"TEXT": {"REGEX": ".+"}},{"LEMMA": "book"}]
matcher.add("testy", None, pattern)

for sent in doc.sents:
    if matcher(nlp(sent.lemma_)):
        print(sent.text)

不幸的是,我只有一場比賽:

“在寫這本書時,他必須抵御外星人和恐龍。”

然而,我也希望得到“他寫了他的第一本書”這句話。 其他寫書將作家作為名詞,其好處是不匹配。

問題是在 Matcher 中,默認情況下,模式中的每個字典都對應一個 token 所以你的正則表達式不匹配任何數量的字符,它匹配任何一個標記,這不是你想要的。

為了得到你想要的,你可以使用OP值來指定你想要匹配任意數量的令牌。 請參閱文檔中的運算符或量詞部分

但是,鑒於您的問題,您可能希望實際使用依賴匹配器,所以我重寫了您的代碼以使用它。 嘗試這個:

import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

doc = nlp("""
Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.
While writing this book, he had to fend off aliens and dinosaurs. Greene's second book might not have been written by him. 
Greene's cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around 
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.""")

matcher = Matcher(nlp.vocab)
pattern = [{"LEMMA": "write"},{"OP": "*"},{"LEMMA": "book"}]
matcher.add("testy", [pattern])

print("----- Using Matcher -----")
for sent in doc.sents:
    if matcher(sent):
        print(sent.text)

print("----- Using Dependency Matcher -----")

deppattern = [
        {"RIGHT_ID": "wrote", "RIGHT_ATTRS": {"LEMMA": "write"}},
        {"LEFT_ID": "wrote", "REL_OP": ">", "RIGHT_ID": "book", 
            "RIGHT_ATTRS": {"LEMMA": "book"}}
        ]

from spacy.matcher import DependencyMatcher

dmatcher = DependencyMatcher(nlp.vocab)

dmatcher.add("BOOK", [deppattern])

for _, (start, end) in dmatcher(doc):
    print(doc[start].sent)

另一件不太重要的事情——你調用匹配器的方式有點奇怪。 您可以傳遞匹配器 Docs 或 Spans,但它們絕對應該是自然文本,因此在句子上調用.lemma_並根據您的情況創建一個新的文檔,但通常應該避免。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM