CoreNLP API for N-grams?

Question

Does CoreNLP have an API for getting unigrams, bigrams, trigrams, etc.?

For example, I have a string "I have the best car " . I would love to get:

I
I have
the
the best
car

based on the string I am passing.

Answer 1

If you are coding in Java, check out getNgrams* functions in the StringUtils class in CoreNLP.

You can also use CollectionUtils.getNgrams (which is what StringUtils class uses too)

Answer 2

You can use CoreNLP to tokenize, but for grabbing n-grams, do it natively in whatever language you're working in. If, say, you're piping this into Python, you can use list slicing and some list comprehensions to split them up:

>>> tokens
['I', 'have', 'the', 'best', 'car']
>>> unigrams = [tokens[i:i+1] for i,w in enumerate(tokens) if i+1 <= len(tokens)]
>>> bigrams = [tokens[i:i+2] for i,w in enumerate(tokens) if i+2 <= len(tokens)]
>>> trigrams = [tokens[i:i+3] for i,w in enumerate(tokens) if i+3 <= len(tokens)]
>>> unigrams
[['I'], ['have'], ['the'], ['best'], ['car']]
>>> bigrams
[['I', 'have'], ['have', 'the'], ['the', 'best'], ['best', 'car']]
>>> trigrams
[['I', 'have', 'the'], ['have', 'the', 'best'], ['the', 'best', 'car']]

CoreNLP is great for doing NLP heavy lifting, like dependencies, coref, POS tagging, etc. It seems like overkill if you just want to tokenize though, like bringing a fire truck to a water gun fight. Using something like TreeTagger might equally fulfill your needs for tokenization.

CoreNLP API for N-grams?

Question

2 answers

solution1
2 ACCPTED 2015-04-28 15:48:18

solution2
1 2015-04-27 21:54:44

CoreNLP API for N-grams?

Question

2 answers

solution1 2 ACCPTED 2015-04-28 15:48:18

solution2 1 2015-04-27 21:54:44

solution1
2 ACCPTED 2015-04-28 15:48:18

solution2
1 2015-04-27 21:54:44