简体   繁体   中英

N-grams for letter in sklearn

I want to do n-grams method but letter by letter

Normal N-grams:

sentence : He want to watch football match

result:
he, he want, want, want to , to , to watch , watch , watch football , football, football match, match

I want to do this but letter by letter:

word : Angela 

result:
a, an, n , ng , g , ge, e ,el, l , la ,a

This is my code using Sklearn , but it is still word-by-word not letter-by-letter:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(ngram_range=(1, 100),token_pattern = r"(?u)\b\w+\b")

corpus = ['Angel','Angelica','John','Johnson']

X = vectorizer.fit_transform(corpus)
analyze = vectorizer.build_analyzer()
print(vectorizer.get_feature_names())
print(vectorizer.transform(['Angela']).toarray())

There is an 'analyzer' param which does what you want.

According to the documentation :-

analyzer : string, {'word', 'char', 'char_wb'} or callable

Whether the feature should be made of word or character n-grams. Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

If a callable is passed it is used to extract the sequence of features out of the raw, unprocessed input.

By default, it is set to word, which you can change.

Just do:

vectorizer = CountVectorizer(ngram_range=(1, 100),
                             token_pattern = r"(?u)\b\w+\b", 
                             analyzer='char')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM