简体   繁体   中英

Apache Spark ML Pipeline: filter empty rows in dataset

In my Spark ML Pipeline (Spark 2.3.0) I use RegexTokenizer like this:

val regexTokenizer = new RegexTokenizer()
      .setInputCol("text")
      .setOutputCol("words")
      .setMinTokenLength(3)

It transforms DataFrame to the one with arrays of words, for example:

text      | words
-------------------------
a the     | [the]
a of to   | []
big small | [big,small]

How to filter rows with empty [] arrays? Should I create custom transformer and pass it to pipeline?

You can use SQLTransformer :

import org.apache.spark.ml.feature.SQLTransformer

val emptyRemover = new SQLTransformer().setStatement(
  "SELECT * FROM __THIS__ WHERE size(words) > 0"
)

which can applied directly

val df = Seq(
  ("a the", Seq("the")), ("a of the", Seq()), 
  ("big small", Seq("big", "small"))
).toDF("text", "words")

emptyRemover.transform(df).show
+---------+------------+
|     text|       words|
+---------+------------+
|    a the|       [the]|
|big small|[big, small]|
+---------+------------+

or used in a Pipeline .

Nonetheless I would consider twice before using this in Spark ML process. Tools normally used downstream, like CountVectorizer , can handle empty input just fine:

import org.apache.spark.ml.feature.CountVectorizer

val vectorizer = new CountVectorizer()
  .setInputCol("words")
  .setOutputCol("features")
+---------+------------+-------------------+                 
|     text|       words|           features|
+---------+------------+-------------------+
|    a the|       [the]|      (3,[2],[1.0])|
| a of the|          []|          (3,[],[])|
|big small|[big, small]|(3,[0,1],[1.0,1.0])|
+---------+------------+-------------------+

and lack of presence of certain words, can often provide useful information.

df
  .select($text, $words)
  .where(size($words) > 0)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM