繁体   English   中英

Apache Spark SQL查询和DataFrame作为参考数据

[英]Apache Spark SQL query and DataFrame as reference data

我有两个Spark DataFrame:

带有以下列的cities DataFrame:

city
-----
London
Austin

bigCities DataFrame,其中包含以下列:

name
------
London
Cairo

我需要转换DataFrame cities并在其中添加一个附加的布尔列: bigCity必须根据以下条件"cities.city IN bigCities.name"计算该列的值

我可以通过以下方式(使用静态bigCities集合)进行此操作:

cities.createOrReplaceTempView("cities")

var resultDf = spark.sql("SELECT city, CASE WHEN city IN ['London', 'Cairo'] THEN 'Y' ELSE 'N' END AS bigCity FROM cities")

但是我不知道如何在查询中用bigCities DataFrame替换静态的bigCities集合['London', 'Cairo'] 我想使用bigCities作为查询中的参考数据。

请告知如何实现此目标。

val df = cities.join(bigCities, $"name".equalTo($"city"), "leftouter").
                withColumn("bigCity", when($"name".isNull, "N").otherwise("Y")).
                drop("name")

您可以在bigCities表上使用collect_list()。 看一下这个

scala> val df_city = Seq(("London"),("Austin")).toDF("city")
df_city: org.apache.spark.sql.DataFrame = [city: string]

scala> val df_bigCities = Seq(("London"),("Cairo")).toDF("name")
df_bigCities: org.apache.spark.sql.DataFrame = [name: string]

scala> df_city.createOrReplaceTempView("cities")

scala> df_bigCities.createOrReplaceTempView("bigCities")

scala> spark.sql(" select city, case when array_contains((select collect_list(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city  |bigCity|
+------+-------+
|London|Y      |
|Austin|N      |
+------+-------+


scala>

如果数据集很大,则可以使用collect_set,它将更加高效。

scala> spark.sql(" select city, case when array_contains((select collect_set(name) from bigcities),city) then 'Y' else 'N' end as bigCity from cities").show(false)
+------+-------+
|city  |bigCity|
+------+-------+
|London|Y      |
|Austin|N      |
+------+-------+


scala>

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM