[英]Getting values of Fields of a Row of DataFrame - Spark Scala
I have a DataFrame which contains several records, 我有一个包含多个记录的DataFrame,
I want to iterate each row of this DataFrame in order to validate the data of each of its columns, doing something like the following code: 我想迭代此DataFrame的每一行,以验证其每一列的数据,并执行以下代码:
val validDF = dfNextRows.map {
x => ValidateRow(x)
}
def ValidateRow(row: Row) : Boolean = {
val nC = row.getString(0)
val si = row.getString(1)
val iD = row.getString(2)
val iH = row.getString(3)
val sF = row.getString(4)
// Stuff to validate the data field of each row
validateNC(nC)
validateSI(SI)
validateID(ID)
validateIF(IF)
validateSF(SF)
true
}
But, doing some tests, if I want to print the value of the val nC (to be sure that I'm sending the corret information to each functions), it doesn't bring me anything: 但是,进行一些测试,如果我想打印val nC的值(以确保将正确的信息发送到每个函数),它不会给我带来任何好处:
def ValidateRow(row: Row) : Boolean = {
val nC = row.getString(0)
val si = row.getString(1)
val iD = row.getString(2)
val iH = row.getString(3)
val sF = row.getString(4)
println(nC)
validateNC(nC)
validateSI(SI)
validateID(ID)
validateIF(IF)
validateSF(SF)
true
}
How can I know that I'm sending the correct information to each function (that I'm reading the data of each column of the row correctly)? 我怎么知道我正在向每个函数发送正确的信息(我正在正确读取行的每一列的数据)?
Regards. 问候。
Spark dataframe function should give you a good start. Spark数据框功能应为您提供一个良好的开端。
If your validate functions are simple enough (like checking for null values), then you can embed the functions as 如果您的验证函数足够简单(例如检查空值),则可以将函数嵌入为
dfNextRows.withColumn("num_cta", when(col("num_cta").isNotNull, col("num_cta").otherwise(lit(0)) ))
You can do the same for other columns in the same manner just by using appropriate spark dataframe functions 您可以通过使用适当的spark数据框函数以相同的方式对其他列执行相同的操作
If your validation rules are complex then you can use udf
functions as 如果您的验证规则很复杂,则可以将udf
函数用作
def validateNC = udf((num_cta : Long)=> {
//define your rules here
})
You can call the udf
function using withColumn
as 您可以调用udf
使用功能withColumn
作为
dfNextRows.withColumn("num_cta", validateNC(col("num_cta")))
You can do so for your rest of the validate rules. 您可以对其余的验证规则执行此操作。
I hope to see your problem get resolved soon 希望您的问题早日得到解决
map
is a transformation, you need to apply an action , for instance you could do dfNextRows.map(x => ValidaLinea(x)).first
. map
是一个转换,您需要应用一个动作 ,例如可以执行dfNextRows.map(x => ValidaLinea(x)).first
。 Spark operates lazily, much like the Stream class on the standard collections. Spark运行缓慢,很像标准集合中的Stream类。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.