How to combine n-grams into one vocabulary in Spark?

Wondering if there is a built-in Spark feature to combine 1-, 2-, n-gram features into a single vocabulary. Setting n=2 in NGram followed by invocation of CountVectorizer results in a dictionary containing only 2-grams. What I really want is to combine all frequent 1-grams, 2-grams, etc into one dictionary for my corpus.

You can train separate NGram and CountVectorizer models and merge using VectorAssembler .

from pyspark.ml.feature import NGram, CountVectorizer, VectorAssembler
from pyspark.ml import Pipeline

def build_ngrams(inputCol="tokens", n=3):

    ngrams = [
        NGram(n=i, inputCol="tokens", outputCol="{0}_grams".format(i))
        for i in range(1, n + 1)

    vectorizers = [
        for i in range(1, n + 1)

    assembler = [VectorAssembler(
        inputCols=["{0}_counts".format(i) for i in range(1, n + 1)],

    return Pipeline(stages=ngrams + vectorizers + assembler)

Example usage:

df = spark.createDataFrame([
  (1, ["a", "b", "c", "d"]),
  (2, ["d", "e", "d"])
], ("id", "tokens"))


