[英]Split text and find the common words in a Spark Dataframe
我正在使用Spark处理Scala,我有一个包含两列文本的数据框。
这些列的格式为“ term1,term2,term3,...”,我想用它们两者的通用术语创建第三列。
例如
Col1
orange, apple, melon
party, clouds, beach
Col2
apple, apricot, watermelon
black, yellow, white
结果将是
Col3
1
0
到目前为止,我所做的是创建一个udf,该udf将文本拆分并获得两列的交集。
val common_terms = udf((a: String, b: String) => if (a.isEmpty || b.isEmpty) {
0
} else {
split(a, ",").intersect(split(b, ",")).length
})
然后在我的数据框上
val results = termsDF.withColumn("col3", common_terms(col("col1"), col("col2"))
但是我有以下错误
Error:(96, 13) type mismatch;
found : String
required: org.apache.spark.sql.Column
split(a, ",").intersect(split(b, ",")).length
因为我是Scala的新手,并且正尝试从在线教程中学习,所以我将不胜感激。
编辑:
val common_authors = udf((a: String, b: String) => if (a != null || b != null) {
0
} else {
val tempA = a.split( ",")
val tempB = b.split(",")
if ( tempA.isEmpty || tempB.isEmpty ) {
0
} else {
tempA.intersect(tempB).length
}
})
编辑后,如果我尝试termsDF.show()
它将运行。 但是,如果我执行类似termsDF.orderBy(desc("col3"))
则会得到java.lang.NullPointerException
尝试
val common_terms = udf((a: String, b: String) => if (a.isEmpty || b.isEmpty) {
0
} else {
var tmp1 = a.split(",")
var tmp2 = b.split(",")
tmp1.intersect(tmp2).length
})
val results = termsDF.withColumn("col3", common_terms($"a", $"b")).show
split(a,“,”)其spark列函数。 您正在使用udf,因此需要使用string.split()这是一个scala函数
编辑后:将null验证更改为==而不是!=
在Spark 2.4 sql中,如果没有UDF,则可以获得相同的结果。 看一下这个:
scala> val df = Seq(("orange,apple,melon","apple,apricot,watermelon"),("party,clouds,beach","black,yellow,white"), ("orange,apple,melon","apple,orange,watermelon")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> df.createOrReplaceTempView("tasos")
scala> spark.sql(""" select col1,col2, filter(split(col1,','), x -> array_contains(split(col2,','),x) ) a1 from tasos """).show(false)
+------------------+------------------------+---------------+
|col1 |col2 |a1 |
+------------------+------------------------+---------------+
|orange,apple,melon|apple,apricot,watermelon|[apple] |
|party,clouds,beach|black,yellow,white |[] |
|orange,apple,melon|apple,orange,watermelon |[orange, apple]|
+------------------+------------------------+---------------+
如果您想要尺寸,那么
scala> spark.sql(""" select col1,col2, filter(split(col1,','), x -> array_contains(split(col2,','),x) ) a1 from tasos """).withColumn("a1_size",size('a1)).show(false)
+------------------+------------------------+---------------+-------+
|col1 |col2 |a1 |a1_size|
+------------------+------------------------+---------------+-------+
|orange,apple,melon|apple,apricot,watermelon|[apple] |1 |
|party,clouds,beach|black,yellow,white |[] |0 |
|orange,apple,melon|apple,orange,watermelon |[orange, apple]|2 |
+------------------+------------------------+---------------+-------+
scala>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.