I have a Spark dataframe which I want to get the statistics
stats_df = df.describe(['mycol'])
stats_df.show()
+-------+------------------+
|summary| mycol|
+-------+------------------+
| count| 300|
| mean| 2243|
| stddev| 319.419860456123|
| min| 1400|
| max| 3100|
+-------+------------------+
How do I extract the values of min
and max
in mycol
using the summary
min
max
column values? How do I do it by number index?
Ok let's consider the following example :
from pyspark.sql.functions import rand, randn
df = sqlContext.range(1, 1000).toDF('mycol')
df.describe().show()
# +-------+-----------------+
# |summary| mycol|
# +-------+-----------------+
# | count| 999|
# | mean| 500.0|
# | stddev|288.5307609250702|
# | min| 1|
# | max| 999|
# +-------+-----------------+
If you want to access the row concerning stddev, per example, you'll just need to convert it into an RDD, collect it and convert it into a dictionary as following :
stats = dict(df.describe().map(lambda r : (r.summary,r.mycol)).collect())
print(stats['stddev'])
# 288.5307609250702
您可以轻松地从该数据帧上的选择中分配变量。
x = stats_df.select('mycol').where('summary' == 'min')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.