Getting values of Fields of a Row of DataFrame - Spark Scala

Question

I have a DataFrame which contains several records,

I want to iterate each row of this DataFrame in order to validate the data of each of its columns, doing something like the following code:

val validDF = dfNextRows.map {
    x => ValidateRow(x)
}

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    // Stuff to validate the data field of each row
    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

But, doing some tests, if I want to print the value of the val nC (to be sure that I'm sending the corret information to each functions), it doesn't bring me anything:

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    println(nC)

    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

How can I know that I'm sending the correct information to each function (that I'm reading the data of each column of the row correctly)?

Regards.

Answer 1

Spark dataframe function should give you a good start.

If your validate functions are simple enough (like checking for null values), then you can embed the functions as

dfNextRows.withColumn("num_cta", when(col("num_cta").isNotNull, col("num_cta").otherwise(lit(0)) ))

You can do the same for other columns in the same manner just by using appropriate spark dataframe functions

If your validation rules are complex then you can use udf functions as

def validateNC = udf((num_cta : Long)=> {
   //define your rules here
})

You can call the udf function using withColumn as

dfNextRows.withColumn("num_cta", validateNC(col("num_cta")))

You can do so for your rest of the validate rules.

I hope to see your problem get resolved soon

Answer 2

map is a transformation, you need to apply an action , for instance you could do dfNextRows.map(x => ValidaLinea(x)).first . Spark operates lazily, much like the Stream class on the standard collections.

Getting values of Fields of a Row of DataFrame - Spark Scala

Question

2 answers

solution1
3 ACCPTED 2017-06-21 02:39:01

solution2
2 2017-06-20 23:25:12

Getting values of Fields of a Row of DataFrame - Spark Scala

Question

2 answers

solution1 3 ACCPTED 2017-06-21 02:39:01

solution2 2 2017-06-20 23:25:12

solution1
3 ACCPTED 2017-06-21 02:39:01

solution2
2 2017-06-20 23:25:12