PySpark Inner Join producing too many rows?

Question

I am using PySpark in Jupyter Notebook. I'm trying to do an inner join with 2 datasets: one has 2455 rows and the other over 1 million. Why is the inner join producing so many rows? It should have less than <2455 rows, surely? Can anyone advise me on this?

print(df.count(),len(df.columns))
19725379 90

print(df1.count(),len(df1.columns))
2455 37

df3 = df.join(df1,"ADDRESS1", "inner")
df3.dropDuplicates(subset=['ADDRESS1']).count
print(df3.count(),len(df3.columns))
603050 126

df3 = df.join(df1,"ADDRESS1", "inner")
print(df3.count(),len(df3.columns))
603050 126

Answer 1

No it's not necessarly, take this example

df 1 =

+------+------+
| t1   | t2   |
+------+------+
|    1 | A    |
|    2 | B    |
+------+------+

df 2 =

+------+------+
| t1   | t3   |
+------+------+
|    1 | A2   |
|    2 | B2   |
|    3 | C2   |
|    1 | D2   |
|    2 | E2   |
+------+------+

in your words the inner join with the key "t1" must be with length not more than 2, but no:

inner join in respect to the first column will be:

+------+------+------+
| t1   | t2   | t3   |
+------+------+------+
|    1 | A    | A2   |
|    1 | A    | D2   |
|    2 | B    | B2   |
|    2 | B    | E2   |
+------+------+------+

PySpark Inner Join producing too many rows?

Question

1 answers

solution1
1 2020-12-10 18:03:50

PySpark Inner Join producing too many rows?

Question

1 answers

solution1 1 2020-12-10 18:03:50

solution1
1 2020-12-10 18:03:50