分组并依靠Spark Data框架的所有列

Question

I want to Perform Group by on each column of the data frame using Spark Sql. 我想使用Spark Sql在数据框的每一列上执行分组依据。 The Dataframe will have approx. 数据框将具有大约 1000 columns. 1000列

I have tried Iterating over all the columns in the data frame and performed groupBy on each column. 我尝试对数据框中的所有列进行迭代，并对每列执行groupBy。 But the program is executing more than 1.5 hour 但是程序执行超过1.5小时

val df = sqlContext
      .read
      .format("org.apache.spark.sql.cassandra")
      .options(Map( "table" -> "exp", "keyspace" -> "testdata"))
      .load()


val groupedData= channelDf.columns.map(c => channelDf.groupBy(c).count().take(10).toList)
println("Printing Dataset :"+ dataset)

If I have columns in the Dataframe For Example Name and Amount then the output should be like 如果我在“数据框”中有“示例名称和金额”列，则输出应类似于

GroupBy on column Name: GroupBy在列名称上：

Name    Count
Jon     2
Ram     5
David   3

GroupBy on column Amount: 按金额分组：

Amount  Count
1000    4
2525    3
3000    3

I want the group by result for each column. 我希望按结果对每一列进行分组。

Answer 1

The only way I can see a speed up here is to cache the df straight after reading it. 我在这里看到加快速度的唯一方法是在读取后直接缓存df 。

Unfortunately, each computation is independant, and you have to do them, there is no "work around". 不幸的是，每个计算都是独立的，您必须要做，没有“变通”方法。

Something like this can speed up a little bit, but not that much : 像这样的东西可以加快一点，但速度却不那么快：

val df = sqlContext
      .read
      .format("org.apache.spark.sql.cassandra")
      .options(Map( "table" -> "exp", "keyspace" -> "testdata"))
      .load()
      .cache()

分组并依靠Spark Data框架的所有列

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-08-12 09:35:22

分组并依靠Spark Data框架的所有列

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-08-12 09:35:22

解决方案1
0 已采纳 2019-08-12 09:35:22