I am using PySpark in Jupyter Notebook. I'm trying to do an inner join with 2 datasets: one has 2455 rows and the other over 1 million. Why is the inner join producing so many rows? It should have less than <2455 rows, surely? Can anyone advise me on this?
print(df.count(),len(df.columns))
19725379 90
print(df1.count(),len(df1.columns))
2455 37
df3 = df.join(df1,"ADDRESS1", "inner")
df3.dropDuplicates(subset=['ADDRESS1']).count
print(df3.count(),len(df3.columns))
603050 126
df3 = df.join(df1,"ADDRESS1", "inner")
print(df3.count(),len(df3.columns))
603050 126
No it's not necessarly, take this example
df 1 =
+------+------+
| t1 | t2 |
+------+------+
| 1 | A |
| 2 | B |
+------+------+
df 2 =
+------+------+
| t1 | t3 |
+------+------+
| 1 | A2 |
| 2 | B2 |
| 3 | C2 |
| 1 | D2 |
| 2 | E2 |
+------+------+
in your words the inner join with the key "t1" must be with length not more than 2, but no:
inner join in respect to the first column will be:
+------+------+------+
| t1 | t2 | t3 |
+------+------+------+
| 1 | A | A2 |
| 1 | A | D2 |
| 2 | B | B2 |
| 2 | B | E2 |
+------+------+------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.