[英]Group by and count on Spark Data frame all columns
I want to Perform Group by on each column of the data frame using Spark Sql. 我想使用Spark Sql在数据框的每一列上执行分组依据。 The Dataframe will have approx. 数据框将具有大约 1000 columns. 1000列
I have tried Iterating over all the columns in the data frame and performed groupBy on each column. 我尝试对数据框中的所有列进行迭代,并对每列执行groupBy。 But the program is executing more than 1.5 hour 但是程序执行超过1.5小时
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "exp", "keyspace" -> "testdata"))
.load()
val groupedData= channelDf.columns.map(c => channelDf.groupBy(c).count().take(10).toList)
println("Printing Dataset :"+ dataset)
If I have columns in the Dataframe For Example Name and Amount then the output should be like 如果我在“数据框”中有“示例名称和金额”列,则输出应类似于
GroupBy on column Name: GroupBy在列名称上:
Name Count
Jon 2
Ram 5
David 3
GroupBy on column Amount: 按金额分组:
Amount Count
1000 4
2525 3
3000 3
I want the group by result for each column. 我希望按结果对每一列进行分组。
The only way I can see a speed up here is to cache the df
straight after reading it. 我在这里看到加快速度的唯一方法是在读取后直接缓存df
。
Unfortunately, each computation is independant, and you have to do them, there is no "work around". 不幸的是,每个计算都是独立的,您必须要做,没有“变通”方法。
Something like this can speed up a little bit, but not that much : 像这样的东西可以加快一点,但速度却不那么快:
val df = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "exp", "keyspace" -> "testdata"))
.load()
.cache()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.