简体   繁体   中英

Getting values of Fields of a Row of DataFrame - Spark Scala

I have a DataFrame which contains several records,

在此处输入图片说明

图片以显示DF包含数据

I want to iterate each row of this DataFrame in order to validate the data of each of its columns, doing something like the following code:

val validDF = dfNextRows.map {
    x => ValidateRow(x)
}

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    // Stuff to validate the data field of each row
    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

But, doing some tests, if I want to print the value of the val nC (to be sure that I'm sending the corret information to each functions), it doesn't bring me anything:

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    println(nC)

    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

在此处输入图片说明

How can I know that I'm sending the correct information to each function (that I'm reading the data of each column of the row correctly)?

Regards.

Spark dataframe function should give you a good start.

If your validate functions are simple enough (like checking for null values), then you can embed the functions as

dfNextRows.withColumn("num_cta", when(col("num_cta").isNotNull, col("num_cta").otherwise(lit(0)) ))

You can do the same for other columns in the same manner just by using appropriate spark dataframe functions

If your validation rules are complex then you can use udf functions as

def validateNC = udf((num_cta : Long)=> {
   //define your rules here
})

You can call the udf function using withColumn as

dfNextRows.withColumn("num_cta", validateNC(col("num_cta")))

You can do so for your rest of the validate rules.

I hope to see your problem get resolved soon

map is a transformation, you need to apply an action , for instance you could do dfNextRows.map(x => ValidaLinea(x)).first . Spark operates lazily, much like the Stream class on the standard collections.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM