[英]How to aggregate the values of 2 columns in a spark dataframe
I have a DataFrame with 4 columns.我有一个有 4 列的 DataFrame。
+---------------+----------------------+---------------+-------------+
| district|sum(aadhaar_generated)| district|sum(rejected)|
+---------------+----------------------+---------------+-------------+
| Namsai| 5| Namsai| 0|
| Champawat| 1584| Champawat| 131|
| Nagaur| 12601| Nagaur| 697|
| Umaria| 2485| Umaria| 106|
| Rajnandgaon| 785| Rajnandgaon| 57|
| Chikkamagaluru| 138| Chikkamagaluru| 26|
|Tiruchirappalli| 542|Tiruchirappalli| 527|
| Baleswar| 2963| Baleswar| 1703|
| Pilibhit| 1858| Pilibhit| 305|
+---------------+----------------------+---------------+-------------+
I need to add the respective position values of sum(aadhaar_generated) and sum(rejected)我需要将 sum(aadhaar_generated) 和 sum(rejected) 的各自位置值相加
For example: for second row my o/p should be:例如:对于第二行,我的 o/p 应该是:
+---------------+------------+
| district| total sum |
+---------------+------------+
| Champawat| 1715 |
+---------------+------------+
ie 1584+131= 1715
即1584+131= 1715
How Can I achieve the same in Scala.我如何在 Scala 中实现相同的目标。
Could you please try below snippet你能试试下面的代码片段吗
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.DoubleType
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StructField
val spark = SparkSession
.builder()
.config("spark.master", "local[1]")
.appName("Test Job")
.getOrCreate()
import spark.implicits._
val sparkContext = spark.sparkContext
sparkContext.setLogLevel("WARN")
//DEFINING INPUT
val inputDF = StructType(Array(StructField("district", StringType, false),
StructField("sum(aadhaar_generated)", DoubleType, false),
StructField("district_name", StringType, false),
StructField("sum(rejected)", DoubleType, false)))
//READING INPUT FILE
val dF = spark.read.format("csv").option("sep", ",")
.option("header", true)
.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
.schema(inputDF)
.load("path\\to\\file");
println("Input DF")
dF.show()
var aggDF = dF.withColumn("Sum_Value", $"sum(aadhaar_generated)" + $"sum(rejected)")
println("After Aggregation")
aggDF.show()
OUTPUT输出
Input DF
+---------------+----------------------+---------------+-------------+
| district|sum(aadhaar_generated)| district_name|sum(rejected)|
+---------------+----------------------+---------------+-------------+
| Namsai| 5.0| Namsai| 0.0|
| Champawat| 1584.0| Champawat| 131.0|
| Nagaur| 12601.0| Nagaur| 697.0|
| Umaria| 2485.0| Umaria| 106.0|
| Rajnandgaon| 785.0| Rajnandgaon| 57.0|
| Chikkamagaluru| 138.0| Chikkamagaluru| 26.0|
|Tiruchirappalli| 542.0|Tiruchirappalli| 527.0|
| Baleswar| 2963.0| Baleswar| 1703.0|
| Pilibhit| 1858.0| Pilibhit| 305.0|
+---------------+----------------------+---------------+-------------+
After Aggregation
+---------------+----------------------+---------------+-------------+---------+
| district|sum(aadhaar_generated)| district_name|sum(rejected)|Sum_Value|
+---------------+----------------------+---------------+-------------+---------+
| Namsai| 5.0| Namsai| 0.0| 5.0|
| Champawat| 1584.0| Champawat| 131.0| 1715.0|
| Nagaur| 12601.0| Nagaur| 697.0| 13298.0|
| Umaria| 2485.0| Umaria| 106.0| 2591.0|
| Rajnandgaon| 785.0| Rajnandgaon| 57.0| 842.0|
| Chikkamagaluru| 138.0| Chikkamagaluru| 26.0| 164.0|
|Tiruchirappalli| 542.0|Tiruchirappalli| 527.0| 1069.0|
| Baleswar| 2963.0| Baleswar| 1703.0| 4666.0|
| Pilibhit| 1858.0| Pilibhit| 305.0| 2163.0|
+---------------+----------------------+---------------+-------------+---------+
Please let me know if that works.请让我知道这是否有效。
EDIT编辑
The following answer assumes that the district
value in both the columns of each row is the same.以下答案假设每行的两列中的district
值相同。
You can do that using the withColumn
method of spark dataframes您可以使用火花数据帧的withColumn
方法来做到这一点
# create some data
>>> data = [['a', 1, 2], ['a', 2, 2], ['b', 4, 3]]
>>> df =spark.createDataFrame(data, ['district','aadhar_generated', 'rejected'])
>>> df.show()
+--------+----------------+--------+
|district|aadhar_generated|rejected|
+--------+----------------+--------+
| a| 1| 2|
| a| 2| 2|
| b| 4| 3|
+--------+----------------+--------+
# create the output column
>>> import pyspark.sql.functions as F
>>> df = df.withColumn("new total", F.col('aadhar_generated')+F.col('rejected'))
>>> df.show()
+--------+----------------+--------+---------+
|district|aadhar_generated|rejected|new total|
+--------+----------------+--------+---------+
| a| 1| 2| 3|
| a| 2| 2| 4|
| b| 4| 3| 7|
+--------+----------------+--------+---------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.