[英]Validate data in 2 columns against master table one column spark.sql
我有2个表,例如ZIPCODE的主表,以及一个包含当前地址和永久地址的交易表。 两个地址列都将具有ZIPCODE。 我需要对照主表验证这2个邮政编码。
Master Table:
+--------+--------------+-----+
|zip_code|territory_name|state|
+--------+--------------+-----+
| 81A02| TERR NAME 02| NY|
| 81A04| TERR NAME 04| FL|
| 81A05| TERR NAME 05| NJ|
| 81A06| TERR NAME 06| CA|
| 81A07| TERR NAME 06| CA|
+--------+--------------+-----+
Transaction table:
+--------+--------------+-----+
|Address1_zc|Address2_zc|state|
+--------+--------------+-----+
| 81A02| 81A05| NY|
| 81A04| 81A06| FL|
| 81A05| 90005| NJ|
| 81A06| 90006| CA|
| 41A06| 81A06| CA|
+--------+--------------+-----+
结果集应仅在ADDRESS1_ZC和ADDRESS2_ZC中包含有效的邮政编码。
+-----------+-----------+-----+
|Address1_zc|Address2_zc|state|
+-----------+-----------+-----+
| 81A02 | 81A05 | NY |
| 81A04 | 81A06 | FL |
+-----------+-----------+-----+
为了进行测试,请提供以下数据框:
df1= sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ"),("81A01","TERR NAME 01","CA"),("81A02","TERR NAME 02","NY"),("81A03","TERR NAME 03","NY"), ("81A03","TERR NAME 03","CA"), ("81A04","TERR NAME 04","FL"), ("81A05","TERR NAME 05","NJ"), ("81A06","TERR NAME 06","CA"), ("81A06","TERR NAME 06","CA")], ["zip_code","territory_name","state"])
df1.createOrReplaceTempView("df1_mast")
df1= sqlContext.createDataFrame([("81A02","81A05"),("81A04","81A06"),("81A05","90005"),("81A06","90006"),("41A06","81A06")], ["Address1_zc","Address2_zc"])
df1.createOrReplaceTempView("df1_tran")
我尝试了以下SQL,但无法获得所需的结果。
select a.* df1_tran a join df1_mast b on a.zip_code = b.Address_zc1 or a.zip_code = b.Address_zc2 where a.zip_code is null
请帮我。
Pyspark方式:
df1 = sqlContext.createDataFrame([("81A01","TERR NAME 01","NJ"),("81A01","TERR NAME 01","CA"),("81A02","TERR NAME 02","NY"),("81A03","TERR NAME 03","NY"), ("81A03","TERR NAME 03","CA"), ("81A04","TERR NAME 04","FL"), ("81A05","TERR NAME 05","NJ"), ("81A06","TERR NAME 06","CA"), ("81A06","TERR NAME 06","CA")], ["zip_code","territory_name","state"])
df2 = sqlContext.createDataFrame([("81A02","81A05"),("81A04","81A06"),("81A05","90005"),("81A05","90006"),("41A06","81A06")], ["Address1_zc","Address2_zc"])
df3 = df2.join(df1, df2['Address1_zc'] == df1['zip_code'], 'inner')
df4 = df3.withColumnRenamed('state', 'state1').drop(*(df1.columns))
df5 = df4.join(df1, df2['Address2_zc'] == df1['zip_code'], 'inner')
df6 = df5.withColumnRenamed('state', 'state2').drop(*(df1.columns))
df4.show()
+-----------+-----------+------+------+
|Address1_zc|Address2_zc|state1|state2|
+-----------+-----------+------+------+
| 81A02 | 81A05 |NY |NJ |
| 81A04 | 81A06 |FL |CA |
+-----------+-----------+------+------+
SQL方式:
SELECT t.*,
a.state AS state1,
b.state AS state2
FROM df2 AS t
JOIN df1 AS a ON t.Address1_zc = a.zip_code
JOIN df1 AS b ON t.Address2_zc = b.zip_code
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.