簡體   English   中英

加入兩個數據幀后 Pyspark 重復

[英]Pyspark duplications after joining two dataframes

我有兩個 Pyspark df

df1

TransID  Date     custusername
1        11/01      1A
2        11/01      1A
3        11/02      1A
4        11/02      1A
5        11/03      1A

df2

custusername   Date    CustID
1A             11/01    xx1
1A             11/02    xx1
1A             11/03    xx2

加入兩個數據幀並計數后所需的輸出

Date   CustID   Count
11/01   xx1      2
11/02   xx1      2
11/03   xx2      1

我得到的實際輸出是

11/01   xx1      2
11/01   xx2      2
11/02   xx1      2
11/02   xx2      2
11/03   xx1      1
11/03   xx2      1

由於 CustID 於 11/03 更新,我的計數重復。

我的代碼

join = [df1.custusername == df2.custusername]
joined = df1.join(df2, join, "inner")

有兩個數據幀:

df1 = spark.createDataFrame([
    (1, "11/01", "1A"),
    (2, "11/01", "1A"),
    (3, "11/02", "1A"),
    (4, "11/02", "1A"),
    (5, "11/03", "1A"),
], schema=['TransId', 'Date', 'custusername'])
df1.show()
+-------+-----+------------+
|TransId| Date|custusername|
+-------+-----+------------+
|      1|11/01|          1A|
|      2|11/01|          1A|
|      3|11/02|          1A|
|      4|11/02|          1A|
|      5|11/03|          1A|
+-------+-----+------------+
df2 = spark.createDataFrame([
    ("1A", "11/01", "xx1"),
    ("1A", "11/02", "xx1"),
    ("1A", "11/03", "xx2"),
], schema=['custusername', 'Date', 'CustId'])
df2.show()
+------------+-----+------+
|custusername| Date|CustId|
+------------+-----+------+
|          1A|11/01|   xx1|
|          1A|11/02|   xx1|
|          1A|11/03|   xx2|
+------------+-----+------+

我將按Datecustusername第一個 DataFrame 進行custusername

df1_group = df1.groupBy('Date', 'custusername').count()
df1_group.show()
+-----+------------+-----+
| Date|custusername|count|
+-----+------------+-----+
|11/01|          1A|    2|
|11/03|          1A|    1|
|11/02|          1A|    2|
+-----+------------+-----+

然后簡單地加入df2

df = df1_group.join(df2, on=['custusername', 'Date'], how='left')
df.show()
+------------+-----+-----+------+
|custusername| Date|count|CustId|
+------------+-----+-----+------+
|          1A|11/01|    2|   xx1|
|          1A|11/03|    1|   xx2|
|          1A|11/02|    2|   xx1|
+------------+-----+-----+------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM