简体   繁体   English

分组并依靠Spark Data框架的所有列

[英]Group by and count on Spark Data frame all columns

I want to Perform Group by on each column of the data frame using Spark Sql. 我想使用Spark Sql在数据框的每一列上执行分组依据。 The Dataframe will have approx. 数据框将具有大约 1000 columns. 1000列

I have tried Iterating over all the columns in the data frame and performed groupBy on each column. 我尝试对数据框中的所有列进行迭代,并对每列执行groupBy。 But the program is executing more than 1.5 hour 但是程序执行超过1.5小时

val df = sqlContext
      .read
      .format("org.apache.spark.sql.cassandra")
      .options(Map( "table" -> "exp", "keyspace" -> "testdata"))
      .load()


val groupedData= channelDf.columns.map(c => channelDf.groupBy(c).count().take(10).toList)
println("Printing Dataset :"+ dataset)

If I have columns in the Dataframe For Example Name and Amount then the output should be like 如果我在“数据框”中有“示例名称和金额”列,则输出应类似于

GroupBy on column Name: GroupBy在列名称上:

Name    Count
Jon     2
Ram     5
David   3

GroupBy on column Amount: 按金额分组:

Amount  Count
1000    4
2525    3
3000    3

I want the group by result for each column. 我希望按结果对每一列进行分组。

The only way I can see a speed up here is to cache the df straight after reading it. 我在这里看到加快速度的唯一方法是在读取后直接缓存df

Unfortunately, each computation is independant, and you have to do them, there is no "work around". 不幸的是,每个计算都是独立的,您必须要做,没有“变通”方法。

Something like this can speed up a little bit, but not that much : 像这样的东西可以加快一点,但速度却不那么快:

val df = sqlContext
      .read
      .format("org.apache.spark.sql.cassandra")
      .options(Map( "table" -> "exp", "keyspace" -> "testdata"))
      .load()
      .cache()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM