[英]Pyspark duplications after joining two dataframes
我有兩個 Pyspark df
df1
TransID Date custusername
1 11/01 1A
2 11/01 1A
3 11/02 1A
4 11/02 1A
5 11/03 1A
df2
custusername Date CustID
1A 11/01 xx1
1A 11/02 xx1
1A 11/03 xx2
加入兩個數據幀並計數后所需的輸出
Date CustID Count
11/01 xx1 2
11/02 xx1 2
11/03 xx2 1
我得到的實際輸出是
11/01 xx1 2
11/01 xx2 2
11/02 xx1 2
11/02 xx2 2
11/03 xx1 1
11/03 xx2 1
由於 CustID 於 11/03 更新,我的計數重復。
我的代碼
join = [df1.custusername == df2.custusername]
joined = df1.join(df2, join, "inner")
有兩個數據幀:
df1 = spark.createDataFrame([
(1, "11/01", "1A"),
(2, "11/01", "1A"),
(3, "11/02", "1A"),
(4, "11/02", "1A"),
(5, "11/03", "1A"),
], schema=['TransId', 'Date', 'custusername'])
df1.show()
+-------+-----+------------+
|TransId| Date|custusername|
+-------+-----+------------+
| 1|11/01| 1A|
| 2|11/01| 1A|
| 3|11/02| 1A|
| 4|11/02| 1A|
| 5|11/03| 1A|
+-------+-----+------------+
df2 = spark.createDataFrame([
("1A", "11/01", "xx1"),
("1A", "11/02", "xx1"),
("1A", "11/03", "xx2"),
], schema=['custusername', 'Date', 'CustId'])
df2.show()
+------------+-----+------+
|custusername| Date|CustId|
+------------+-----+------+
| 1A|11/01| xx1|
| 1A|11/02| xx1|
| 1A|11/03| xx2|
+------------+-----+------+
我將按Date
和custusername
第一個 DataFrame 進行custusername
。
df1_group = df1.groupBy('Date', 'custusername').count()
df1_group.show()
+-----+------------+-----+
| Date|custusername|count|
+-----+------------+-----+
|11/01| 1A| 2|
|11/03| 1A| 1|
|11/02| 1A| 2|
+-----+------------+-----+
然后簡單地加入df2
df = df1_group.join(df2, on=['custusername', 'Date'], how='left')
df.show()
+------------+-----+-----+------+
|custusername| Date|count|CustId|
+------------+-----+-----+------+
| 1A|11/01| 2| xx1|
| 1A|11/03| 1| xx2|
| 1A|11/02| 2| xx1|
+------------+-----+-----+------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.