I have a dataframe with different column (or attributes) and I want to get another dataframe which contains only those columns which have more that 6 different values.
How could I get it?
The below snippet accomplishes your requirement. The sample dataset has three columns (col1,col2,col3). col3 has only one unique value 3 while col1 and col2 has 6 distinct values. the final dataframe has only co11 and col2.
df = spark.createDataFrame([(1,2,3),(10,20,3),(20,40,3),(40,50,3),(50,60,3),(60,70,3)],['col1','col2','col3'])
columns = [ column for column in df.columns if len(df.select(column).distinct().collect()) >= 6 ]
>>> df.select(columns).show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 10| 20|
| 20| 40|
| 40| 50|
| 50| 60|
| 60| 70|
+----+----+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.