简体   繁体   中英

How to obtain row percentages of crosstab from a spark dataframe using python?

I used python code:

df.stat.crosstab("age", "y").orderBy("age_y").show()

to create a crosstab from a spark dataframe as follows:

在此处输入图像描述

However, I cannot find a code to obtain the row percentages. For example, age 18 row percentages should be 5/12 = 41.7% for 'no' and 7/12 = 58.3% for 'yes'. The sum of 2 percentages is 100%.

May someone advise me in this case? Many thanks in advance.

Simply add 2 columns using using withColumn and your formula to calculate the percentages:

from pyspark.sql import functions as F

df1 = df.stat.crosstab("age", "y").orderBy("age_y")

result = df1.withColumn(
    "no_rp",
    F.round(F.col("no") / (F.col("no") + F.col("yes")) * 100, 2)
).withColumn(
    "yes_rp",
    F.round(F.col("yes") / (F.col("no") + F.col("yes")) * 100, 2)
)

result.show()

#+-----+---+---+-----+------+
#|age_y| no|yes|no_rp|yes_rp|
#+-----+---+---+-----+------+
#|   18|  5|  7|41.67| 58.33|
#|   19| 24| 11|68.57| 31.43|
#|   20| 35| 15| 70.0|  30.0|
#+-----+---+---+-----+------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM