简体   繁体   English

如何聚合火花数据框中 2 列的值

[英]How to aggregate the values of 2 columns in a spark dataframe

I have a DataFrame with 4 columns.我有一个有 4 列的 DataFrame。

+---------------+----------------------+---------------+-------------+          
|       district|sum(aadhaar_generated)|       district|sum(rejected)|
+---------------+----------------------+---------------+-------------+
|         Namsai|                     5|         Namsai|            0|
|      Champawat|                  1584|      Champawat|          131|
|         Nagaur|                 12601|         Nagaur|          697|
|         Umaria|                  2485|         Umaria|          106|
|    Rajnandgaon|                   785|    Rajnandgaon|           57|
| Chikkamagaluru|                   138| Chikkamagaluru|           26|
|Tiruchirappalli|                   542|Tiruchirappalli|          527|
|       Baleswar|                  2963|       Baleswar|         1703|
|       Pilibhit|                  1858|       Pilibhit|          305|
+---------------+----------------------+---------------+-------------+

I need to add the respective position values of sum(aadhaar_generated) and sum(rejected)我需要将 sum(aadhaar_generated) 和 sum(rejected) 的各自位置值相加

For example: for second row my o/p should be:例如:对于第二行,我的 o/p 应该是:

+---------------+------------+          
|       district|  total sum |                                                                   
+---------------+------------+
|      Champawat| 1715       |
+---------------+------------+

ie 1584+131= 17151584+131= 1715

How Can I achieve the same in Scala.我如何在 Scala 中实现相同的目标。

Could you please try below snippet你能试试下面的代码片段吗

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.DoubleType
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.StructField

val spark = SparkSession
  .builder()
  .config("spark.master", "local[1]")
  .appName("Test Job")
  .getOrCreate()

import spark.implicits._
val sparkContext = spark.sparkContext
sparkContext.setLogLevel("WARN")

//DEFINING INPUT
val inputDF = StructType(Array(StructField("district", StringType, false),
  StructField("sum(aadhaar_generated)", DoubleType, false),
  StructField("district_name", StringType, false),
  StructField("sum(rejected)", DoubleType, false)))

//READING INPUT FILE
val dF = spark.read.format("csv").option("sep", ",")
  .option("header", true)
  .option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ")
  .schema(inputDF)
  .load("path\\to\\file");

println("Input DF")
dF.show()

var aggDF = dF.withColumn("Sum_Value", $"sum(aadhaar_generated)" + $"sum(rejected)")
println("After Aggregation")
aggDF.show()

OUTPUT输出

Input DF
+---------------+----------------------+---------------+-------------+
|       district|sum(aadhaar_generated)|  district_name|sum(rejected)|
+---------------+----------------------+---------------+-------------+
|         Namsai|                   5.0|         Namsai|          0.0|
|      Champawat|                1584.0|      Champawat|        131.0|
|         Nagaur|               12601.0|         Nagaur|        697.0|
|         Umaria|                2485.0|         Umaria|        106.0|
|    Rajnandgaon|                 785.0|    Rajnandgaon|         57.0|
| Chikkamagaluru|                 138.0| Chikkamagaluru|         26.0|
|Tiruchirappalli|                 542.0|Tiruchirappalli|        527.0|
|       Baleswar|                2963.0|       Baleswar|       1703.0|
|       Pilibhit|                1858.0|       Pilibhit|        305.0|
+---------------+----------------------+---------------+-------------+

After Aggregation
+---------------+----------------------+---------------+-------------+---------+
|       district|sum(aadhaar_generated)|  district_name|sum(rejected)|Sum_Value|
+---------------+----------------------+---------------+-------------+---------+
|         Namsai|                   5.0|         Namsai|          0.0|      5.0|
|      Champawat|                1584.0|      Champawat|        131.0|   1715.0|
|         Nagaur|               12601.0|         Nagaur|        697.0|  13298.0|
|         Umaria|                2485.0|         Umaria|        106.0|   2591.0|
|    Rajnandgaon|                 785.0|    Rajnandgaon|         57.0|    842.0|
| Chikkamagaluru|                 138.0| Chikkamagaluru|         26.0|    164.0|
|Tiruchirappalli|                 542.0|Tiruchirappalli|        527.0|   1069.0|
|       Baleswar|                2963.0|       Baleswar|       1703.0|   4666.0|
|       Pilibhit|                1858.0|       Pilibhit|        305.0|   2163.0|
+---------------+----------------------+---------------+-------------+---------+

Please let me know if that works.请让我知道这是否有效。

EDIT编辑

The following answer assumes that the district value in both the columns of each row is the same.以下答案假设每行的两列中的district值相同。


You can do that using the withColumn method of spark dataframes您可以使用火花数据帧的withColumn方法来做到这一点

# create some data
>>> data = [['a', 1, 2], ['a', 2, 2], ['b', 4, 3]]
>>> df =spark.createDataFrame(data, ['district','aadhar_generated', 'rejected'])
>>> df.show()
+--------+----------------+--------+
|district|aadhar_generated|rejected|
+--------+----------------+--------+
|       a|               1|       2|
|       a|               2|       2|
|       b|               4|       3|
+--------+----------------+--------+

# create the output column
>>> import pyspark.sql.functions as F
>>> df = df.withColumn("new total", F.col('aadhar_generated')+F.col('rejected'))
>>> df.show()
+--------+----------------+--------+---------+
|district|aadhar_generated|rejected|new total|
+--------+----------------+--------+---------+
|       a|               1|       2|        3|
|       a|               2|       2|        4|
|       b|               4|       3|        7|
+--------+----------------+--------+---------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM