Given a pySpark DataFrame, how can I get all possible unique combinations of columns col1
and col2
.
I can get unique values for a single column, but cannot get unique pairs of col1
and col2
:
df.select('col1').distinct().rdd.map(lambda r: r[0]).collect()
I tried this, but it doesn't seem to work:
df.select(['col1','col2']).distinct().rdd.map(lambda r: r[0]).collect()
The one I tried,
>>> df = spark.createDataFrame([(1,2),(1,3),(1,2),(2,3)],['col1','col2'])
>>> df.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 1| 3|
| 1| 2|
| 2| 3|
+----+----+
>>> df.select('col1','col2').distinct().rdd.map(lambda r:r[0]).collect() ## your mapping
[1, 2, 1]
>>> df.select('col1','col2').distinct().show()
+----+----+
|col1|col2|
+----+----+
| 1| 3|
| 2| 3|
| 1| 2|
+----+----+
>>> df.select('col1','col2').distinct().rdd.map(lambda r:(r[0],r[1])).collect()
[(1, 3), (2, 3), (1, 2)]
请尝试以下功能:
`df[['col1', 'col2']].drop_duplicates()`
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.