[英]pyspark:Get columns based on other records
我有一個看起來像這樣的數據框
membershipAccountNbr cntryRetailChannelCustId
111590058 1010015900581000010101
214100897 1010041008972100010101
104100897 1010041008971000010101
另一個看起來像這樣:
membershipAccountNbr parentMembershipNbr
111590058 111590058
214100897 104100897
我的目標是使其看起來像:
membershipAccountNbr parentMembershipNbr parentCustId
111590058 111590058 1010015900581000010101
214100897 104100897 1010041008971000010101
我嘗試使用聯接,但它們給出了歧義錯誤。 我是Pyspark的新手,請幫助。
假設df1
是,
+--------------------+------------------------+
|membershipAccountNbr|cntryRetailChannelCustId|
+--------------------+------------------------+
| 111590058| 10100159005810000...|
| 214100897| 10100410089721000...|
| 104100897| 10100410089710000...|
+--------------------+------------------------+
和df2
,
+--------------------+-------------------+
|membershipAccountNbr|parentMembershipNbr|
+--------------------+-------------------+
| 111590058| 111590058|
| 214100897| 104100897|
+--------------------+-------------------+
然后你跑
df1.join(df2, on="membershipAccountNbr", how="right").select(
col("membershipAccountNbr"),
col("parentMembershipNbr"),
col("cntryRetailChannelCustId").alias("parentCustId"),
).show()
結果看起來像這樣,
+--------------------+-------------------+--------------------+
|membershipAccountNbr|parentMembershipNbr| parentCustId|
+--------------------+-------------------+--------------------+
| 111590058| 111590058|10100159005810000...|
| 214100897| 104100897|10100410089721000...|
+--------------------+-------------------+--------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.