Having trouble finding ngram documentation that works

Question

I'm trying to use the ngram function in python and having trouble implementing correctly for a problem I'm working on

I've tried plugging in ngram and ngrams

import nltk
from nltk.util import ngrams

def n_grams(words, min=2, max=3):
    s = []
    for n in range(min, max):
        for ngram in ngrams(words, n):
            s.append(' '.join(str(i) for i in ngram))
    return s

t = 'hippopotomonstrosesquippedaliophobia'
t_split = re.split(r'\W*', t)
print(n_grams(t_split))

I'm trying to return the following:

#{'tr', 'ho', 'hi', 'to', 'om', 'io', 'ob', 'mo', 'ed', 'ip', 'al', 'bi', 'pe', 
#'da', 'po', 'ns', 'qu', 'st', 'ia', 'ot', 'se', 'op', 'ro', 'ui', 'li', 'pp', 
#'es', 'sq', 'ph', 'on', 'os'} 

but instead returning this:
#[' h', 'h i', 'i p', 'p p', 'p o', 'o p', 'p o', 'o t', 't o', 'o m', #'m o', 'o n', 'n s', 's t', 't r', 'r o', 'o s', 's e', 'e s', 's q', #'q u', 'u i', 'i p', 'p p', 'p e', 'e d', 'd a', 'a l', 'l i', 'i o', #'o p', 'p h', 'h o', 'o b', 'b i', 'i a', 'a ']

Answer 1

Really, the only issues here are the superfluous regex and the join syntax. You're calling re.split() on a pattern that is matching zero to unlimited nonword characters ([^a-zA-Z0-9_]), but you don't actually have any strings that match that pattern. There's nothing for it to split on, so the regex is returning the whole word unchanged (and throwing an error in Python 3.6+). Testing it out in a few Python interpreters, it looks like it may be splitting on the beginning and end of your string as well, but that could be an artifact either of the version you're using or the join statement (see below) -- I can't tell from this snippet.

If I use the n_grams function as you've written it, but call it with no space in the join instead of with a join, and delete your regex entirely, I think it gets what you want (the set of bigraphs):

print(set(n_grams(t)))

Which is:

{'es', 'op', 'bi', 'hi', 'ot', 'ro', 'ph', 'al', 
 'ns', 'sq', 'ho', 'ed', 'ob', 'ip', 'to', 'io', 
 'on', 'da', 'pe', 'om', 'mo', 'ia', 'st', 'po', 
 'tr', 'qu', 'se', 'ui', 'pp', 'li', 'os'}

If you choose to from collections import Counter , then you can also get this:

print(Counter(n_grams(t)))

yielding a count dictionary, essentially:

Counter({'ip': 2, 'pp': 2, 'po': 2, 'op': 2, 'hi': 1, 'ot': 1, 'to': 1, 'om': 
  1, 'mo': 1, 'on': 1, 'ns': 1, 'st': 1, 'tr': 1, 'ro': 1, 'os': 1, 'se': 1, 
  'es': 1, 'sq': 1, 'qu': 1, 'ui': 1, 'pe': 1, 'ed': 1, 'da': 1, 'al': 1, 'li': 
  1, 'io': 1, 'ph': 1, 'ho': 1, 'ob': 1, 'bi': 1, 'ia': 1})

To handle edge characters, you can tell NLTK's ngram function to use right and left padding, and specify the characters (ordinarily "<s>" and "</s>" ), but that doesn't seem to be necessary in this example.

Having trouble finding ngram documentation that works

Question

1 answers

solution1
0 ACCPTED 2019-11-10 05:06:44

Having trouble finding ngram documentation that works

Question

1 answers

solution1 0 ACCPTED 2019-11-10 05:06:44

solution1
0 ACCPTED 2019-11-10 05:06:44