My goal is to input an array of phrases as in
array = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.","At vero eos et accusam et justo duo dolores et ea rebum.","Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]
and to present a new phrase to it, like
"Felix qui potuit rerum cognoscere causas"
and I want it to tell me whether this is likely part of the group in the aforementioned array
or not.
I found how to detect frequencies of words, but how do I find unsimilarity
? After all, my goal is to find unusual phrases, not the frequency of certain words.
You can build a simple "language model" for this purpose. It will estimate probability of a phrase, and mark phrases with low average per-word probability as unusual.
For word probability estimation, it can use a smoothed word count.
This is how the model could look like:
import re
import numpy as np
from collections import Counter
class LanguageModel:
""" A simple model to measure 'unusualness' of sentences.
delta is a smoothing parameter.
The larger delta is, the higher is the penalty for unseen words.
"""
def __init__(self, delta=0.01):
self.delta = delta
def preprocess(self, sentence):
words = sentence.lower().split()
return [re.sub(r"[^A-Za-z]+", '', word) for word in words]
def fit(self, corpus):
""" Estimate counts from an array of texts """
self.counter_ = Counter(word
for sentence in corpus
for word in self.preprocess(sentence))
self.total_count_ = sum(self.counter_.values())
self.vocabulary_size_ = len(self.counter_.values())
def perplexity(self, sentence):
""" Calculate negative mean log probability of a word in a sentence
The higher this number, the more unusual the sentence is.
"""
words = self.preprocess(sentence)
mean_log_proba = 0.0
for word in words:
# use a smoothed version of "probability" to work with unseen words
word_count = self.counter_.get(word, 0) + self.delta
total_count = self.total_count_ + self.vocabulary_size_ * self.delta
word_probability = word_count / total_count
mean_log_proba += np.log(word_probability) / len(words)
return -mean_log_proba
def relative_perplexity(self, sentence):
""" Perplexity, normalized between 0 (the most usual sentence) and 1 (the most unusual)"""
return (self.perplexity(sentence) - self.min_perplexity) / (self.max_perplexity - self.min_perplexity)
@property
def max_perplexity(self):
""" Perplexity of an unseen word """
return -np.log(self.delta / (self.total_count_ + self.vocabulary_size_ * self.delta))
@property
def min_perplexity(self):
""" Perplexity of the most likely word """
return self.perplexity(self.counter_.most_common(1)[0][0])
You can train this model and apply it to different sentences.
train = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
"At vero eos et accusam et justo duo dolores et ea rebum.",
"Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]
test = ["Felix qui potuit rerum cognoscere causas", # an "unlikely" phrase
'sed diam nonumy eirmod sanctus sit amet', # a "likely" phrase
]
lm = LanguageModel()
lm.fit(train)
for sent in test:
print(lm.perplexity(sent).round(3), sent)
which prints to you
8.525 Felix qui potuit rerum cognoscere causas
3.517 sed diam nonumy eirmod sanctus sit amet
You can see that "unusualness" is higher for the first phrase than for the second, because the second one is made from the training words.
If your corpus of "usual" phrases is large enough, you can switch from 1-gram models I use to N-grams (for English, sensible N is 2 or 3). Alternatively, you can use recurrent neural nets to predict probability of each word conditional on all the previous words. But this requires a really huge training corpus.
If you work with a highly flective language, like Turkish, you can use character-level N-grams instead of a word-level model, or just preprocess your texts using a lemmatization algorithm from NLTK.
For finding common phrases in sentence you can use Gensim Phrase (collocation) detection
But if you want to detect unusual phrases maybe you will describe some part of speech combination patterns with RegEx and doing POS tagging on input sentence you will be able to extract unseen words(phrases) which match with your pattern.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.