![](/img/trans.png)
[英]convert dataframe to dict, using a column name as key and another columns as k, v - python, pandas, dataframe
[英]How to get distinct value, count of a column in dataframe and store in another dataframe as (k,v) pair using Spark2 and Scala
我想获取 dataframe 的每一列的不同值及其各自的计数,并将它们作为 (k,v) 存储在另一个 dataframe 中。 注意:我的列不是 static,它们一直在变化。 所以,我不能硬核列名,而是应该循环遍历它们。
例如,下面是我的 dataframe
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| Blaze | IND| 19950312|
| Scarlet | USA| 19950313|
| Jonas | CAD| 19950312|
| Blaze | USA| 19950312|
| Jonas | CAD| 19950312|
| mark | USA| 19950313|
| mark | CAD| 19950313|
| Smith | USA| 19950313|
| mark | UK | 19950313|
| scarlet | CAD| 19950313|
我的最终结果应该在一个新的 dataframe 作为 (k,v) 中创建,其中 k 是不同的记录,v 是它的计数。
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| (Blaze,2) | (IND,1) |(19950312,3)|
| (Scarlet,2) | (USA,4) |(19950313,6)|
| (Jonas,3) | (CAD,4) | |
| (mark,3) | (UK,1) | |
| (smith,1) | | |
谁能帮我解决这个问题,我正在使用 Spark 2.4.0 和 Scala 2.11.12
注意:我的列是动态的,所以我不能对列进行硬核化并对它们进行分组。
我对您的查询没有确切的解决方案,但我肯定可以为您提供一些帮助,帮助您开始解决您的问题。
创建 dataframe
scala> val df = Seq(("Blaze ","IND","19950312"),
| ("Scarlet","USA","19950313"),
| ("Jonas ","CAD","19950312"),
| ("Blaze ","USA","19950312"),
| ("Jonas ","CAD","19950312"),
| ("mark ","USA","19950313"),
| ("mark ","CAD","19950313"),
| ("Smith ","USA","19950313"),
| ("mark ","UK ","19950313"),
| ("scarlet","CAD","19950313")).toDF("name", "country","dob")
接下来计算每列的不同元素的计数
scala> val distCount = df.columns.map(c => df.groupBy(c).count)
创建一个范围以迭代 distCount
scala> val range = Range(0,distCount.size)
range: scala.collection.immutable.Range = Range(0, 1, 2)
汇总您的数据
scala> val aggVal = range.toList.map(i => distCount(i).collect().mkString).toSeq
aggVal: scala.collection.immutable.Seq[String] = List([Jonas ,2][Smith ,1][Scarlet,1][scarlet,1][mark ,3][Blaze ,2], [CAD,4][USA,4][IND,1][UK ,1], [19950313,6][19950312,4])
创建数据框:
scala> Seq((aggVal(0),aggVal(1),aggVal(2))).toDF("name", "country","dob").show()
+--------------------+--------------------+--------------------+
| name| country| dob|
+--------------------+--------------------+--------------------+
|[Jonas ,2][Smith...|[CAD,4][USA,4][IN...|[19950313,6][1995...|
+--------------------+--------------------+--------------------+
我希望这对您有所帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.