根據Apache Spark中數組中的單詞過濾DataFrame

Question

我試圖通過僅獲取那些包含數組中單詞的行來過濾數據集。 我正在使用contains方法，它適用於字符串，但不適用於數組。 下面是代碼

val dataSet = spark.read.option("header","true").option("inferschema","true").json(path).na.drop.cache()

val threats_path = spark.read.textFile("src/main/resources/cyber_threats").collect()

val newData = dataSet.select("*").filter(col("_source.raw_text").contains(threats_path)).show()

由於threats_path是字符串數組，並且包含字符串的工作，因此無法正常工作。 任何幫助，將不勝感激。

Answer 1

您可以在列上使用isin udf

它會像

val threats_path = spark.read.textFile("src/main/resources/cyber_threats").collect()

val dataSet = ???

dataSet.where(col("_source.raw_text").isin(thread_path: _*))

請注意，如果thread_paths的大小很大，這將對性能產生影響，這是因為collect和使用isin的過濾器。

我建議您使用join將filter dataSet與threats_path使用。 它會像

val dataSet = spark.read.option("header","true").option("inferschema","true").json(path).na.drop

val threats_path = spark.read.textFile("src/main/resources/cyber_threats")

val newData = threats_path.join(dataSet, col("_source.raw_text") === col("<col in threats_path >"), "leftouter").show()

希望這可以幫助

根據Apache Spark中數組中的單詞過濾DataFrame

問題描述

1 個解決方案

解決方案1
0 已采納 2018-10-02 17:38:20

根據Apache Spark中數組中的單詞過濾DataFrame

問題描述

1 個解決方案

解決方案1 0 已采納 2018-10-02 17:38:20

解決方案1
0 已采納 2018-10-02 17:38:20