[英]Apache Spark SQL query and DataFrame as reference data
我有兩個Spark DataFrame:
帶有以下列的cities
DataFrame:
city
-----
London
Austin
bigCities
DataFrame,其中包含以下列:
name
------
London
Cairo
我需要轉換DataFrame cities
並在其中添加一個附加的布爾列: bigCity
必須根據以下條件"cities.city IN bigCities.name"
計算該列的值
我可以通過以下方式(使用靜態bigCities集合)進行此操作:
cities.createOrReplaceTempView("cities")
var resultDf = spark.sql("SELECT city, CASE WHEN city IN ['London', 'Cairo'] THEN 'Y' ELSE 'N' END AS bigCity FROM cities")
但是我不知道如何在查詢中用bigCities
DataFrame替換靜態的bigCities集合['London', 'Cairo']
。 我想使用bigCities
作為查詢中的參考數據。
請告知如何實現此目標。
val df = cities.join(bigCities, $"name".equalTo($"city"), "leftouter").
withColumn("bigCity", when($"name".isNull, "N").otherwise("Y")).
drop("name")
您可以在bigCities表上使用collect_list()。 看一下這個
scala> val df_city = Seq(("London"),("Austin")).toDF("city")
df_city: org.apache.spark.sql.DataFrame = [city: string]
scala> val df_bigCities = Seq(("London"),("Cairo")).toDF("name")
df_bigCities: org.apache.spark.sql.DataFrame = [name: string]
scala> df_city.createOrReplaceTempView("cities")
scala> df_bigCities.createOrReplaceTempView("bigCities")
scala> spark.sql(" select city, case when array_contains((select collect_list(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city |bigCity|
+------+-------+
|London|Y |
|Austin|N |
+------+-------+
scala>
如果數據集很大,則可以使用collect_set,它將更加高效。
scala> spark.sql(" select city, case when array_contains((select collect_set(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city |bigCity|
+------+-------+
|London|Y |
|Austin|N |
+------+-------+
scala>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.