[英]Join, Aggregate Then Select Specific Columns In Apache Spark
Products and there respective sales are loaded from csv files correctly like so 像这样正确地从csv文件加载产品及其销售
Dataset<Row> dfProducts = sparkSession.read()
.option("mode", "DROPMALFORMED")
.option("header", "true")
.option("inferSchema", "true")
.option("charset", "UTF-8")
.csv(new ClassPathResource("products.csv").getURL().getPath());
Dataset<Row> dfSaledetails = sparkSession.read()
.option("mode", "DROPMALFORMED")
.option("header", "true")
.option("inferSchema", "true")
.option("charset", "UTF-8")
.csv(new ClassPathResource("saledetails.csv").getURL().getPath());
Product has columns (product_id, product_name, ...). 产品具有列(product_id,product_name等)。 Sales has columns (product_id, amount, ...)
销售具有列(product_id,金额,...)
What i need to achieve is join the two datasets based on a common column (product_id)
, group by product_id
, sum column amount
and then select/display only specific columns(product_name and the result from summation) 我需要实现的是加入基于公共列中的两个数据集
(product_id)
,按组product_id
,总和柱amount
,然后选择/显示只有特定列(这里给,并从求和的结果)
The following is my attempt 以下是我的尝试
Dataset<Row> dfSalesTotals = dfSaledetails
.join(dfProducts, dfSaledetails.col("product_id").equalTo(dfProducts.col("product_id")))
.groupBy(dfSaledetails.col("product_id"))
.agg(sum(dfSaledetails.col("amount")).alias("total_amount"))
.select(dfProducts.col("product_name"), col("total_amount"));
dfSalesTotals.show();
Which throws following error 哪个抛出以下错误
Caused by: org.apache.spark.sql.AnalysisException: Resolved attribute(s) product_name#215 missing from product_id#272,total_amount#499 in operator !Project [product_name#215, total_amount#499].;;
;;原因:org.apache.spark.sql.AnalysisException:操作员!项目[product_name#215,total_amount#499]中product_id#272,total_amount#499中缺少已解决的属性product_name#215。 !Project [product_name#215, total_amount#499] +- Aggregate [product_id#272], [product_id#272, sum(amount#277) AS total_amount#499] +- Join Inner, (product_id#272 = product_id#212) :- Relation[sale_detail_auto_id#266,sale_auto_id#267,sale_id#268,agent_id#269,sale_detail_id#270,inventory_id#271,product_id#272,unit_cost#273,unit_price#274,vat#275,quantity#276,amount#277,promotion_id#278,discount#279] csv +- Relation[product_id#212,user_group_id_super_owner#213,product_category#214,product_name#215,product_type#216,product_code#217,distributor_code#218,product_units#219,product_unitCost#220,product_manufacturer#221,product_distributor#222,create_date#223,update_date#224,vat#225,product_weight#226,carton_size#227,product_listStatus#228,active_status#229,distributor_type#230,bundle_type#231,barcode_type#232,product_family_id#233] csv
!项目[product_name#215,total_amount#499] +-总计[product_id#272],[product_id#272,sum(amount#277)AS total_amount#499] +-加入内部(product_id#272 = product_id#212) :-关系[sale_detail_auto_id#266,sale_auto_id#267,sale_id#268,agent_id#269,sale_detail_id#270,库存ID#271,产品ID#272,单位成本#273,单位价格#274,vat#275,数量#276,数量# 277,promotion_id#278,折扣#279] csv +-关系[product_id#212,user_group_id_super_owner#213,product_category#214,product_name#215,product_type#216,product_code#217,distributor_code#218,product_units#219,product_unitCost#220 ,product_manufacturer#221,product_distributor#222,create_date#223,update_date#224,vat#225,product_weight#226,纸盒尺寸#227,product_listStatus#228,active_status#229,distributor_type#230,bundle_type#231,barcode_type#232,product_family_id #233] csv
If you want to keep product_name
it should be either in groupBy
如果要保留
product_name
则它应位于groupBy
.groupBy(
dfSaledetails.col("product_id"),
col("product_name")))
or in the agg
或在
agg
.agg(
sum(dfSaledetails.col("amount")).alias("total_amount"),
first(col("product_name")).alias("product_name"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.