[英]How to compare two dataframes and calculate the differences in PySpark?
我有兩個數據框,我正在嘗試編寫一個 function 來比較這兩個數據框,以便它會向我返回對受影響列的 .net 更改。
DF1:
+---------------+------+------+-------+----------+
| City | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta | 10 | 1 | 100 | 400 |
+---------------+------+------+-------+----------+
| Chicago | 100 | 2 | 200 | 500 |
+---------------+------+------+-------+----------+
| Boston | 100 | 3 | 300 | 600 |
+---------------+------+------+-------+----------+
| San Francisco | 1000 | 4 | 400 | 700 |
+---------------+------+------+-------+----------+
DF2:
+---------------+------+------+-------+----------+
| City | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta | 10 | 1 | 150 | 400 |
+---------------+------+------+-------+----------+
| Chicago | 100 | 2 | 200 | 450 |
+---------------+------+------+-------+----------+
| Boston | 100 | 3 | 300 | 650 |
+---------------+------+------+-------+----------+
| San Francisco | 1200 | 4 | 400 | 750 |
+---------------+------+------+-------+----------+
我希望結果是這樣的:
+---------------+------+------+-------+----------+
| City | Temp | Zone | Score | Activity |
+---------------+------+------+-------+----------+
| Atlanta | 0 | 0 | 50 | 0 |
+---------------+------+------+-------+----------+
| Boston | 0 | 0 | 0 | -50 |
+---------------+------+------+-------+----------+
| San Francisco | 200 | 0 | 0 | 50 |
+---------------+------+------+-------+----------+
我是 PySpark 的新手,想知道如何在 PySpark 中實現這一點?
我嘗試執行df2.substract(df1)
但它只顯示了 df2 中不在 df1 中的行,如果我只想查看任何列發生的凈更改,這不是很簡單。
注:城市名稱為唯一標識。 每一行都是不同的。
感謝你的幫助!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.