简体   繁体   English

数据框中ArrayType列之间的差异

[英]Difference between columns of ArrayType in dataframe

I have dataframe with 2 ArrayType columns. 我有2个ArrayType列的数据框。 I want to find the difference between columns. 我想找到列之间的区别。 column1 will always have values while column2 may have empty array. column1将始终具有值,而column2可能具有空数组。 I created following udf but it is not working 我创建了以下udf,但它无法正常工作

df.show() gives following records df.show()给出以下记录

SampleData: 样本数据:

["Test", "Test1","Test3", "Test2"], ["Test", "Test1"]

Code: 码:

sc.udf.register("diff", (value: Column,value1: Column)=>{ 
                        value.asInstanceOf[Seq[String]].diff(value1.asInstanceOf[Seq[String]])          
                    })  

Output: 输出:

["Test2","Test3"]

Spark version 1.4.1 Any help will be appreciated. Spark版本1.4.1任何帮助将不胜感激。

column1 will always have values while column2 may have empty array. column1将始终具有值,而column2可能具有空数组。

your comment : it gives all values of value – undefined_variable 你的评论 :它给出了所有值的值 - undefined_variable

Example1 : 例1:

lets see small example like this... 让我们看看像这样的小例子......

   val A = Seq(1,1)

 A: Seq[Int] = List(1, 1)

 val B = Seq.empty

 B: Seq[Nothing] = List()

A diff B

 res0: Seq[Int] = List(1, 1)

if you do a collection.SeqLike.diff then you will get A value as shown in example. 如果你做一个collection.SeqLike.diff然后你会得到一个值,如例子所示。 As per scala, this is very much valid case since you told you are always getting value which is seq. 根据scala,这是非常有效的情况,因为你告诉你总是得到seq的value

Also, reverse case is like this... 另外,反向情况就是这样......

 B diff A

 res1: Seq[Nothing] = List()

if you use Spark udf for doing above as well then same results will come. 如果您使用Spark udf进行上述操作,则会产生相同的结果。

EDIT : (if one array not empty case as you modified your example ) 编辑:(如果你修改你的例子,一个数组不是空的情况)

Example2 : 例2:

 val p = Seq("Test", "Test1","Test3", "Test2")

 p: Seq[String] = List(Test, Test1, Test3, Test2)

 val q = Seq("Test", "Test1")

 q: Seq[String] = List(Test, Test1)

 p diff q

 res2: Seq[String] = List(Test3, Test2)

This is what your expected output which is coming as given in your example. 这是您的预期输出,如您的示例中所示。

Reverse case : I think this is what you are getting which is not expected by you. 反向案例:我认为这是你得到的,这是你不期望的。

q diff p

 res3: Seq[String] = List()

You need to change your udf to: 您需要将您的udf更改为:

val diff_udf = udf { ( a:  Seq[String], 
                       b:  Seq[String]) => a diff b }

Then this works: 然后这工作:

import org.apache.spark.sql.functions.col
df.withColumn("diff",
  diff_udf(col("col1"), col("col2"))).show
+--------------------+-----------------+------------------+
|                col1|             col2|              diff|
+--------------------+-----------------+------------------+
|List(Test, Test1,...|List(Test, Test1)|List(Test3, Test2)|
+--------------------+-----------------+------------------+

Data 数据

val df = sc.parallelize(Seq((List("Test", "Test1","Test3", "Test2"), 
                             List("Test", "Test1")))).toDF("col1", "col2")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Spark Dataframe中使用arraytype - working with arraytype in spark Dataframe 如何在 spark DataFrame 中将多个浮点列连接到一个 ArrayType(FloatType()) 中? - How can I concat several float columns into one ArrayType(FloatType()) in spark DataFrame? 创建一个包含数千列的 Spark dataframe,然后添加一个包含所有列的 ArrayType 列 - Create a Spark dataframe with thousands of columns and then add a column of ArrayType that hold them all 将 Spark 中的多个 ArrayType Columns 合并为一个 ArrayType Column - Combine multiple ArrayType Columns in Spark into one ArrayType Column 获取火花数据帧中 ArrayType 列的不同元素 - get the distinct elements of an ArrayType column in a spark dataframe 将Spark Dataframe列转换为仅一行的数据框(ArrayType) - Transforming a Spark Dataframe Column into a Dataframe with just one line (ArrayType) 如何更改 StructType 或 ArrayType 列中的所有列数据类型? - How to change all columns data types in StructType or ArrayType columns? 从Spark数据框列中ArrayType类型的行中获取不同的元素 - Get distinct elements from rows of type ArrayType in Spark dataframe column 从现有的数组类型列创建单独的 Spark 数据帧 - Creating Separate Spark dataframe from existing arraytype column 从 Spark Dataframe 的 ArrayType 列中删除 Scala 中的空列表 - Remove empty lists in Scala from ArrayType column in Spark Dataframe
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM