简体   繁体   中英

Finding unusual phrases using a “bag of usual phrases”

My goal is to input an array of phrases as in

array = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.","At vero eos et accusam et justo duo dolores et ea rebum.","Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]

and to present a new phrase to it, like

"Felix qui potuit rerum cognoscere causas"

and I want it to tell me whether this is likely part of the group in the aforementioned array or not.

I found how to detect frequencies of words, but how do I find unsimilarity ? After all, my goal is to find unusual phrases, not the frequency of certain words.

You can build a simple "language model" for this purpose. It will estimate probability of a phrase, and mark phrases with low average per-word probability as unusual.

For word probability estimation, it can use a smoothed word count.

This is how the model could look like:

import re
import numpy as np
from collections import Counter

class LanguageModel:
    """ A simple model to measure 'unusualness' of sentences. 
    delta is a smoothing parameter. 
    The larger delta is, the higher is the penalty for unseen words.
    """
    def __init__(self, delta=0.01):
        self.delta = delta
    def preprocess(self, sentence):
        words = sentence.lower().split()
        return [re.sub(r"[^A-Za-z]+", '', word) for word in words]
    def fit(self, corpus):
        """ Estimate counts from an array of texts """
        self.counter_ = Counter(word 
                                 for sentence in corpus 
                                 for word in self.preprocess(sentence))
        self.total_count_ = sum(self.counter_.values())
        self.vocabulary_size_ = len(self.counter_.values())
    def perplexity(self, sentence):
        """ Calculate negative mean log probability of a word in a sentence 
        The higher this number, the more unusual the sentence is.
        """
        words = self.preprocess(sentence)
        mean_log_proba = 0.0
        for word in words:
            # use a smoothed version of "probability" to work with unseen words
            word_count = self.counter_.get(word, 0) + self.delta
            total_count = self.total_count_ + self.vocabulary_size_ * self.delta
            word_probability = word_count / total_count
            mean_log_proba += np.log(word_probability) / len(words)
        return -mean_log_proba

    def relative_perplexity(self, sentence):
        """ Perplexity, normalized between 0 (the most usual sentence) and 1 (the most unusual)"""
        return (self.perplexity(sentence) - self.min_perplexity) / (self.max_perplexity - self.min_perplexity)

    @property
    def max_perplexity(self):
        """ Perplexity of an unseen word """
        return -np.log(self.delta / (self.total_count_ + self.vocabulary_size_ * self.delta))

    @property
    def min_perplexity(self):
        """ Perplexity of the most likely word """
        return self.perplexity(self.counter_.most_common(1)[0][0])

You can train this model and apply it to different sentences.

train = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
                 "At vero eos et accusam et justo duo dolores et ea rebum.",
                 "Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]
test = ["Felix qui potuit rerum cognoscere causas", # an "unlikely" phrase
        'sed diam nonumy eirmod sanctus sit amet', # a "likely" phrase
       ]

lm = LanguageModel()
lm.fit(train)

for sent in test:
    print(lm.perplexity(sent).round(3), sent)

which prints to you

8.525 Felix qui potuit rerum cognoscere causas
3.517 sed diam nonumy eirmod sanctus sit amet

You can see that "unusualness" is higher for the first phrase than for the second, because the second one is made from the training words.

If your corpus of "usual" phrases is large enough, you can switch from 1-gram models I use to N-grams (for English, sensible N is 2 or 3). Alternatively, you can use recurrent neural nets to predict probability of each word conditional on all the previous words. But this requires a really huge training corpus.

If you work with a highly flective language, like Turkish, you can use character-level N-grams instead of a word-level model, or just preprocess your texts using a lemmatization algorithm from NLTK.

For finding common phrases in sentence you can use Gensim Phrase (collocation) detection

But if you want to detect unusual phrases maybe you will describe some part of speech combination patterns with RegEx and doing POS tagging on input sentence you will be able to extract unseen words(phrases) which match with your pattern.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM