简体   繁体   中英

Having trouble finding ngram documentation that works

I'm trying to use the ngram function in python and having trouble implementing correctly for a problem I'm working on

I've tried plugging in ngram and ngrams

import nltk
from nltk.util import ngrams

def n_grams(words, min=2, max=3):
    s = []
    for n in range(min, max):
        for ngram in ngrams(words, n):
            s.append(' '.join(str(i) for i in ngram))
    return s

t = 'hippopotomonstrosesquippedaliophobia'
t_split = re.split(r'\W*', t)
print(n_grams(t_split))

I'm trying to return the following:

#{'tr', 'ho', 'hi', 'to', 'om', 'io', 'ob', 'mo', 'ed', 'ip', 'al', 'bi', 'pe', 
#'da', 'po', 'ns', 'qu', 'st', 'ia', 'ot', 'se', 'op', 'ro', 'ui', 'li', 'pp', 
#'es', 'sq', 'ph', 'on', 'os'} 

but instead returning this:
#[' h', 'h i', 'i p', 'p p', 'p o', 'o p', 'p o', 'o t', 't o', 'o m', #'m o', 'o n', 'n s', 's t', 't r', 'r o', 'o s', 's e', 'e s', 's q', #'q u', 'u i', 'i p', 'p p', 'p e', 'e d', 'd a', 'a l', 'l i', 'i o', #'o p', 'p h', 'h o', 'o b', 'b i', 'i a', 'a ']

Really, the only issues here are the superfluous regex and the join syntax. You're calling re.split() on a pattern that is matching zero to unlimited nonword characters ([^a-zA-Z0-9_]), but you don't actually have any strings that match that pattern. There's nothing for it to split on, so the regex is returning the whole word unchanged (and throwing an error in Python 3.6+). Testing it out in a few Python interpreters, it looks like it may be splitting on the beginning and end of your string as well, but that could be an artifact either of the version you're using or the join statement (see below) -- I can't tell from this snippet.

If I use the n_grams function as you've written it, but call it with no space in the join instead of with a join, and delete your regex entirely, I think it gets what you want (the set of bigraphs):

print(set(n_grams(t)))

Which is:

{'es', 'op', 'bi', 'hi', 'ot', 'ro', 'ph', 'al', 
 'ns', 'sq', 'ho', 'ed', 'ob', 'ip', 'to', 'io', 
 'on', 'da', 'pe', 'om', 'mo', 'ia', 'st', 'po', 
 'tr', 'qu', 'se', 'ui', 'pp', 'li', 'os'}

If you choose to from collections import Counter , then you can also get this:

print(Counter(n_grams(t)))

yielding a count dictionary, essentially:

Counter({'ip': 2, 'pp': 2, 'po': 2, 'op': 2, 'hi': 1, 'ot': 1, 'to': 1, 'om': 
  1, 'mo': 1, 'on': 1, 'ns': 1, 'st': 1, 'tr': 1, 'ro': 1, 'os': 1, 'se': 1, 
  'es': 1, 'sq': 1, 'qu': 1, 'ui': 1, 'pe': 1, 'ed': 1, 'da': 1, 'al': 1, 'li': 
  1, 'io': 1, 'ph': 1, 'ho': 1, 'ob': 1, 'bi': 1, 'ia': 1})

To handle edge characters, you can tell NLTK's ngram function to use right and left padding, and specify the characters (ordinarily "<s>" and "</s>" ), but that doesn't seem to be necessary in this example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM