PYSPARK：將一個表列與另一個表的兩列之一連接起來

Question

我的問題如下：

Table 1
ID1 ID2
 1  2 
 3  4

Table 2
C1    VALUE
 1    London
 4    Texas

Table3 
 C3    VALUE
  2     Paris
  3     Arizona

表1具有主要和次要ID。 我需要創建一個最終輸出，該輸出是基於來自table1的ID映射的Table2和Table3的值的聚合。

即，如果將table2或table3中的值映射到兩個ID中的任何一個，則應將其匯總為一個。

i.e my final output should look like:

ID  Aggregated
1  [2, London, Paris] // since Paris is mapped to 2 which is turn is mapped to 1
3  [4, Texas, Arizona] // Texas is mapped to 4 which in turn is mapped to 3

任何建議如何在pyspark中實現這一目標。

我不確定加入表格是否可以解決此問題。

我本以為PairedRDD可能會對此有所幫助，但是我無法提出適當的解決方案。

謝謝

Answer 1

下面是一個非常簡單的方法：

spark.sql(
"""
  select 1 as id1,2 as id2 
  union
  select 3 as id1,4 as id2 
""").createOrReplaceTempView("table1")

spark.sql(
"""
  select 1 as c1, 'london' as city 
  union
  select 4 as c1, 'texas' as city 
""").createOrReplaceTempView("table2")

spark.sql(
"""
  select 2 as c1, 'paris' as city 
  union
  select 3 as c1, 'arizona' as city 
""").createOrReplaceTempView("table3")

spark.table("table1").show()
spark.table("table2").show()
spark.table("table3").show()

# for simplicity, union table2 and table 3

spark.sql(""" select * from table2 union all select * from table3 """).createOrReplaceTempView("city_mappings")
spark.table("city_mappings").show()

# now join to the ids:

spark.sql("""
  select id1, id2, city from table1
  join city_mappings on c1 = id1 or c1 = id2
""").createOrReplaceTempView("id_to_city")

# and finally you can aggregate: 

spark.sql("""
select id1, id2, collect_list(city)
from id_to_city
group by id1, id2
""").createOrReplaceTempView("result")

table("result").show()

# result looks like this, you can reshape to better suit your needs :
+---+---+------------------+
|id1|id2|collect_list(city)|
+---+---+------------------+
|  1|  2|   [london, paris]|
|  3|  4|  [texas, arizona]|
+---+---+------------------+

PYSPARK：將一個表列與另一個表的兩列之一連接起來

問題描述

1 個解決方案

解決方案1
0 2018-11-20 16:41:28

PYSPARK：將一個表列與另一個表的兩列之一連接起來

問題描述

1 個解決方案

解決方案1 0 2018-11-20 16:41:28

解決方案1
0 2018-11-20 16:41:28