繁体   English   中英

如何在pyspark中获得百分比频率

[英]How can I obtain percentage frequencies in pyspark

我想在pyspark获得百分比频率。 我在python中这样做了如下

Companies = df['Company'].value_counts(normalize = True)

获得频率非常简单:

# Dates in descending order of complaint frequency 
df.createOrReplaceTempView('Comp')
CompDF = spark.sql("SELECT Company, count(*) as cnt \
                    FROM Comp \
                    GROUP BY Company \
                    ORDER BY cnt DESC")
CompDF.show()
+--------------------+----+  
|             Company| cnt|  
+--------------------+----+  
|BANK OF AMERICA, ...|1387|  
|       EQUIFAX, INC.|1285|  
|WELLS FARGO & COM...|1119|  
|Experian Informat...|1115|  
|TRANSUNION INTERM...|1001|  
|JPMORGAN CHASE & CO.| 905|  
|      CITIBANK, N.A.| 772|  
|OCWEN LOAN SERVIC...| 481|  

如何从此处获得百分比频率? 我尝试了一堆运气不大的东西。 任何帮助,将不胜感激。

作为苏雷什在评论中暗示,假设total_count是在数据帧的行数Companies ,你可以使用withColumn添加一个新列命名的percentagesCompDF

total_count = Companies.count()

df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))

可能正在修改SQL查询将获得您想要的结果。

    "SELECT Company,cnt/(SELECT SUM(cnt) from (SELECT Company, count(*) as cnt 
    FROM Comp GROUP BY Company ORDER BY cnt DESC) temp_tab) sum_freq from 
    (SELECT Company, count(*) as cnt FROM Comp GROUP BY Company ORDER BY cnt 
    DESC)"

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM