简体   繁体   中英

How to add some noun phrases that I already known when doing noun_chunks in spacy (or np_extractor in textblob)?

I'm using noun_chunks in spacy and np_extractor in textblob to find all phrases in some articles. Their are some technical terms parsed wrong. ex: "ANOVA is also called analysis of variance" and the result shows that noun phrases are "ANOVA", "analysis", "variance" but I think the correct noun phrases are "ANOVA", "analysis of variance". I already have a phrase list contaning some technical phrases and I think it can help parsing. How can I use this list to retrain or improve the noun phrase extractor?

This sounds like a good use case for rule-based matching . It's especially powerful in a scenario like yours where you get to combine the statistical models (eg noun chunks based on the part-of-speech tags and dependencies) with your own rules to cover the remaining specific cases.

Here's a simple example:

import spacy
from spacy.matcher import PhraseMatcher

nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
# This is just an example – see the docs for how to do this more elegantly
matcher.add("PHRASES", None, nlp("ANOVA"), nlp("analysis of variance"))

doc = nlp("A text about analysis of variance or ANOVA")
matches = matcher(doc)
for match_id, start, end in matches:
    matched_span = doc[start:end]
    print(matched_span.text)
# analysis of variance
# ANOVA

The start and end index of the matches let you create a span – so you'll end up with Span objects, just like the ones returned by doc.noun_chunks . If you want to solve this even more elegantly, you cou also add a custom attribute like doc._.custom_noun_chunks that runs the matcher on the Doc and returns the matched spans, or even the matched spans plus the original noun chunks.

Btw, the doc.noun_chunks are based on the part-of-speech tags and the dependency parse. You can check out the code for how they're computed in English here . While you could theoretically improve the noun chunks by fine-tuning the tagger and parser, this approach seems kinda overkill and much more speculative for your use case. If you already have the phrase list, you might as well match it directly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM