获取DataFrame行的字段的值-Spark Scala

Question

I have a DataFrame which contains several records, 我有一个包含多个记录的DataFrame，

I want to iterate each row of this DataFrame in order to validate the data of each of its columns, doing something like the following code: 我想迭代此DataFrame的每一行，以验证其每一列的数据，并执行以下代码：

val validDF = dfNextRows.map {
    x => ValidateRow(x)
}

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    // Stuff to validate the data field of each row
    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

But, doing some tests, if I want to print the value of the val nC (to be sure that I'm sending the corret information to each functions), it doesn't bring me anything: 但是，进行一些测试，如果我想打印val nC的值（以确保将正确的信息发送到每个函数），它不会给我带来任何好处：

def ValidateRow(row: Row) : Boolean =  {
    val nC = row.getString(0)
    val si = row.getString(1)
    val iD = row.getString(2)
    val iH = row.getString(3)
    val sF = row.getString(4)

    println(nC)

    validateNC(nC)
    validateSI(SI)
    validateID(ID)
    validateIF(IF)
    validateSF(SF)
    true
}

How can I know that I'm sending the correct information to each function (that I'm reading the data of each column of the row correctly)? 我怎么知道我正在向每个函数发送正确的信息（我正在正确读取行的每一列的数据）？

Regards. 问候。

Answer 1

Spark dataframe function should give you a good start. Spark数据框功能应为您提供一个良好的开端。

If your validate functions are simple enough (like checking for null values), then you can embed the functions as 如果您的验证函数足够简单（例如检查空值），则可以将函数嵌入为

dfNextRows.withColumn("num_cta", when(col("num_cta").isNotNull, col("num_cta").otherwise(lit(0)) ))

You can do the same for other columns in the same manner just by using appropriate spark dataframe functions 您可以通过使用适当的spark数据框函数以相同的方式对其他列执行相同的操作

If your validation rules are complex then you can use udf functions as 如果您的验证规则很复杂，则可以将udf函数用作

def validateNC = udf((num_cta : Long)=> {
   //define your rules here
})

You can call the udf function using withColumn as 您可以调用udf使用功能withColumn作为

dfNextRows.withColumn("num_cta", validateNC(col("num_cta")))

You can do so for your rest of the validate rules. 您可以对其余的验证规则执行此操作。

I hope to see your problem get resolved soon 希望您的问题早日得到解决

Answer 2

map is a transformation, you need to apply an action , for instance you could do dfNextRows.map(x => ValidaLinea(x)).first . map是一个转换，您需要应用一个动作，例如可以执行dfNextRows.map(x => ValidaLinea(x)).first 。 Spark operates lazily, much like the Stream class on the standard collections. Spark运行缓慢，很像标准集合中的Stream类。

获取DataFrame行的字段的值-Spark Scala

问题描述

2 个解决方案

解决方案1
3 已采纳 2017-06-21 02:39:01

解决方案2
2 2017-06-20 23:25:12

获取DataFrame行的字段的值-Spark Scala

问题描述

2 个解决方案

解决方案1 3 已采纳 2017-06-21 02:39:01

解决方案2 2 2017-06-20 23:25:12

解决方案1
3 已采纳 2017-06-21 02:39:01

解决方案2
2 2017-06-20 23:25:12