[英]How can I obtain percentage frequencies in pyspark
我想在pyspark获得百分比频率。 我在python中这样做了如下
Companies = df['Company'].value_counts(normalize = True)
获得频率非常简单:
# Dates in descending order of complaint frequency
df.createOrReplaceTempView('Comp')
CompDF = spark.sql("SELECT Company, count(*) as cnt \
FROM Comp \
GROUP BY Company \
ORDER BY cnt DESC")
CompDF.show()
+--------------------+----+
| Company| cnt|
+--------------------+----+
|BANK OF AMERICA, ...|1387|
| EQUIFAX, INC.|1285|
|WELLS FARGO & COM...|1119|
|Experian Informat...|1115|
|TRANSUNION INTERM...|1001|
|JPMORGAN CHASE & CO.| 905|
| CITIBANK, N.A.| 772|
|OCWEN LOAN SERVIC...| 481|
如何从此处获得百分比频率? 我尝试了一堆运气不大的东西。 任何帮助,将不胜感激。
作为苏雷什在评论中暗示,假设total_count
是在数据帧的行数Companies
,你可以使用withColumn
添加一个新列命名的percentages
在CompDF
:
total_count = Companies.count()
df = CompDF.withColumn('percentage', CompDF.cnt/float(total_counts))
可能正在修改SQL查询将获得您想要的结果。
"SELECT Company,cnt/(SELECT SUM(cnt) from (SELECT Company, count(*) as cnt
FROM Comp GROUP BY Company ORDER BY cnt DESC) temp_tab) sum_freq from
(SELECT Company, count(*) as cnt FROM Comp GROUP BY Company ORDER BY cnt
DESC)"
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.