简体   繁体   中英

Spark (Scala): How to turn an Array[Row] into either a DataSet[Row] or a DataFrame?

I have an Array[Row] and I want to turn it into either a Dataset[Row] or DataFrame .

How did I come up with an Array of Rows?

Well, I was trying to clear nulls from my dataset:

  • without having to filter EACH column (I have a lot) and..
  • without using the .na.drop() function from DataFrameNaFunctions because it fails to detect when a cell actually has the string "null" .

So, I came up with the following line to filter out null in all columns.

val outDF = inputDF.columns.flatMap { col => inputDF.filter(col + "!='' AND " + col + "!='null'").collect() }

Problem is, outDF is an Array[Row] , hence the question! Any ideas welcome!

This is what your code would do if it worked:

inputDF.columns.map {
  col => inputDF.filter((inputDF(col) =!= "") and (inputDF(col) =!= "null"))
}.reduce(_ union _)

and something like this:

inputDF.where(inputDF.columns.map {
  col => (inputDF(col) =!= "") and (inputDF(col) =!= "null")
}.foldLeft(lit(true))(_ and _))

is what you want.

Note that the first solution creates non-exclusive subsets so with data like this:

val inputDF = Seq(("1", "a"), ("2", ""), ("null", "")).toDF

the result would be:

| _1| _2|
|  1|  a|
|  2|   |
|  1|  a|

For the solution I believe to be correct:

| _1| _2|
|  1|  a|


df.na.drop(df.columns).where("'null' not in ("+df.columns.mkString(",")+")")

This was answered by using the following code, base on Mr Srinivas's comment:

//First drop all typical nulls
val prelimDF = inputDF.na.drop()

//Then drops all columns actually saying 'null'
val finalDF = prelimDF.na.drop(prelimDF.columns).where("'null' not in ("+prelimDF.columns.mkString(",")+")")


The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM