简体   繁体   中英

CoreNLP API for N-grams?

Does CoreNLP have an API for getting unigrams, bigrams, trigrams, etc.?

For example, I have a string "I have the best car " . I would love to get:

I
I have
the
the best
car

based on the string I am passing.

If you are coding in Java, check out getNgrams* functions in the StringUtils class in CoreNLP.

You can also use CollectionUtils.getNgrams (which is what StringUtils class uses too)

You can use CoreNLP to tokenize, but for grabbing n-grams, do it natively in whatever language you're working in. If, say, you're piping this into Python, you can use list slicing and some list comprehensions to split them up:

>>> tokens
['I', 'have', 'the', 'best', 'car']
>>> unigrams = [tokens[i:i+1] for i,w in enumerate(tokens) if i+1 <= len(tokens)]
>>> bigrams = [tokens[i:i+2] for i,w in enumerate(tokens) if i+2 <= len(tokens)]
>>> trigrams = [tokens[i:i+3] for i,w in enumerate(tokens) if i+3 <= len(tokens)]
>>> unigrams
[['I'], ['have'], ['the'], ['best'], ['car']]
>>> bigrams
[['I', 'have'], ['have', 'the'], ['the', 'best'], ['best', 'car']]
>>> trigrams
[['I', 'have', 'the'], ['have', 'the', 'best'], ['the', 'best', 'car']]

CoreNLP is great for doing NLP heavy lifting, like dependencies, coref, POS tagging, etc. It seems like overkill if you just want to tokenize though, like bringing a fire truck to a water gun fight. Using something like TreeTagger might equally fulfill your needs for tokenization.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM