简体   繁体   中英

How could I get columns that meet a condition from a dataframe in pyspark?

I have a dataframe with different column (or attributes) and I want to get another dataframe which contains only those columns which have more that 6 different values.

How could I get it?

The below snippet accomplishes your requirement. The sample dataset has three columns (col1,col2,col3). col3 has only one unique value 3 while col1 and col2 has 6 distinct values. the final dataframe has only co11 and col2.

df = spark.createDataFrame([(1,2,3),(10,20,3),(20,40,3),(40,50,3),(50,60,3),(60,70,3)],['col1','col2','col3'])
columns = [ column for column in df.columns if len(df.select(column).distinct().collect()) >= 6 ]
>>> df.select(columns).show()
+----+----+
|col1|col2|
+----+----+
|   1|   2|
|  10|  20|
|  20|  40|
|  40|  50|
|  50|  60|
|  60|  70|
+----+----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM