连接，聚合然后选择Apache Spark中的特定列

Question

Products and there respective sales are loaded from csv files correctly like so 像这样正确地从csv文件加载产品及其销售

    Dataset<Row> dfProducts = sparkSession.read()
            .option("mode", "DROPMALFORMED")
            .option("header", "true")
            .option("inferSchema", "true")
            .option("charset", "UTF-8")
            .csv(new ClassPathResource("products.csv").getURL().getPath());
    Dataset<Row> dfSaledetails = sparkSession.read()
            .option("mode", "DROPMALFORMED")
            .option("header", "true")
            .option("inferSchema", "true")
            .option("charset", "UTF-8")
            .csv(new ClassPathResource("saledetails.csv").getURL().getPath());

Product has columns (product_id, product_name, ...). 产品具有列（product_id，product_name等）。 Sales has columns (product_id, amount, ...) 销售具有列（product_id，金额，...）

What i need to achieve is join the two datasets based on a common column (product_id) , group by product_id , sum column amount and then select/display only specific columns(product_name and the result from summation) 我需要实现的是加入基于公共列中的两个数据集(product_id) ，按组product_id ，总和柱amount ，然后选择/显示只有特定列（这里给，并从求和的结果）

The following is my attempt 以下是我的尝试

    Dataset<Row> dfSalesTotals = dfSaledetails
            .join(dfProducts, dfSaledetails.col("product_id").equalTo(dfProducts.col("product_id")))
            .groupBy(dfSaledetails.col("product_id"))
            .agg(sum(dfSaledetails.col("amount")).alias("total_amount"))
            .select(dfProducts.col("product_name"), col("total_amount"));
    dfSalesTotals.show();

Which throws following error 哪个抛出以下错误

Caused by: org.apache.spark.sql.AnalysisException: Resolved attribute(s) product_name#215 missing from product_id#272,total_amount#499 in operator !Project [product_name#215, total_amount#499].;; ;;原因：org.apache.spark.sql.AnalysisException：操作员！项目[product_name＃215，total_amount＃499]中product_id＃272，total_amount＃499中缺少已解决的属性product_name＃215。 !Project [product_name#215, total_amount#499] +- Aggregate [product_id#272], [product_id#272, sum(amount#277) AS total_amount#499] +- Join Inner, (product_id#272 = product_id#212) :- Relation[sale_detail_auto_id#266,sale_auto_id#267,sale_id#268,agent_id#269,sale_detail_id#270,inventory_id#271,product_id#272,unit_cost#273,unit_price#274,vat#275,quantity#276,amount#277,promotion_id#278,discount#279] csv +- Relation[product_id#212,user_group_id_super_owner#213,product_category#214,product_name#215,product_type#216,product_code#217,distributor_code#218,product_units#219,product_unitCost#220,product_manufacturer#221,product_distributor#222,create_date#223,update_date#224,vat#225,product_weight#226,carton_size#227,product_listStatus#228,active_status#229,distributor_type#230,bundle_type#231,barcode_type#232,product_family_id#233] csv ！项目[product_name＃215，total_amount＃499] +-总计[product_id＃272]，[product_id＃272，sum（amount＃277）AS total_amount＃499] +-加入内部（product_id＃272 = product_id＃212）：-关系[sale_detail_auto_id＃266，sale_auto_id＃267，sale_id＃268，agent_id＃269，sale_detail_id＃270，库存ID＃271，产品ID＃272，单位成本＃273，单位价格＃274，vat＃275，数量＃276，数量＃ 277，promotion_id＃278，折扣＃279] csv +-关系[product_id＃212，user_group_id_super_owner＃213，product_category＃214，product_name＃215，product_type＃216，product_code＃217，distributor_code＃218，product_units＃219，product_unitCost＃220 ，product_manufacturer＃221，product_distributor＃222，create_date＃223，update_date＃224，vat＃225，product_weight＃226，纸盒尺寸＃227，product_listStatus＃228，active_status＃229，distributor_type＃230，bundle_type＃231，barcode_type＃232，product_family_id ＃233] csv

Answer 1

If you want to keep product_name it should be either in groupBy 如果要保留product_name则它应位于groupBy

.groupBy(
  dfSaledetails.col("product_id"),
  col("product_name")))

or in the agg 或在agg

.agg(
  sum(dfSaledetails.col("amount")).alias("total_amount"), 
  first(col("product_name")).alias("product_name"))

连接，聚合然后选择Apache Spark中的特定列

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-12-19 14:15:25

连接，聚合然后选择Apache Spark中的特定列

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-12-19 14:15:25

解决方案1
2 已采纳 2018-12-19 14:15:25