How could I get columns that meet a condition from a dataframe in pyspark?

Question

I have a dataframe with different column (or attributes) and I want to get another dataframe which contains only those columns which have more that 6 different values.

How could I get it?

Answer 1

The below snippet accomplishes your requirement. The sample dataset has three columns (col1,col2,col3). col3 has only one unique value 3 while col1 and col2 has 6 distinct values. the final dataframe has only co11 and col2.

df = spark.createDataFrame([(1,2,3),(10,20,3),(20,40,3),(40,50,3),(50,60,3),(60,70,3)],['col1','col2','col3'])
columns = [ column for column in df.columns if len(df.select(column).distinct().collect()) >= 6 ]
>>> df.select(columns).show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|  10|  20|
|  20|  40|
|  40|  50|
|  50|  60|
|  60|  70|
+----+----+

How could I get columns that meet a condition from a dataframe in pyspark?

Question

1 answers

solution1
3 ACCPTED 2017-05-23 14:09:08

How could I get columns that meet a condition from a dataframe in pyspark?

Question

1 answers

solution1 3 ACCPTED 2017-05-23 14:09:08

solution1
3 ACCPTED 2017-05-23 14:09:08