![](/img/trans.png)
[英]Spark SQL Query to Compare two Spark Dataframes using Spark Java
[英]SQL Query to JAVA in Spark
我正在嘗試使用JAVA將SQL查詢轉換為Spark程序進行練習。 我正在發布我正在使用的兩個文件的架構。 也是我試圖轉換的查詢
每個文件的架構:Store_return的架構
root
|-- datetime: long (nullable = true)
|-- sr_returned_date_sk: long (nullable = true)
|-- sr_return_time_sk: long (nullable = true)
|-- sr_item_sk: long (nullable = true)
|-- sr_customer_sk: long (nullable = true)
|-- sr_cdemo_sk: long (nullable = true)
|-- sr_hdemo_sk: long (nullable = true)
|-- sr_addr_sk: long (nullable = true)
|-- sr_store_sk: long (nullable = true)
|-- sr_reason_sk: long (nullable = true)
|-- sr_ticket_number: long (nullable = true)
|-- sr_return_quantity: integer (nullable = true)
|-- sr_return_amt: double (nullable = true)
|-- sr_return_tax: double (nullable = true)
|-- sr_return_amt_inc_tax: double (nullable = true)
|-- sr_fee: double (nullable = true)
|-- sr_return_ship_cost: double (nullable = true)
|-- sr_refunded_cash: double (nullable = true)
|-- sr_reversed_charge: double (nullable = true)
|-- sr_store_credit: double (nullable = true)
|-- sr_net_loss: double (nullable = true)
date_dim的架構:
root
|-- d_date_sk: long (nullable = true)
|-- d_date_id: string (nullable = true)
|-- d_date: string (nullable = true)
|-- d_month_seq: integer (nullable = true)
|-- d_week_seq: integer (nullable = true)
|-- d_quarter_seq: integer (nullable = true)
|-- d_year: integer (nullable = true)
|-- d_dow: integer (nullable = true)
|-- d_moy: integer (nullable = true)
|-- d_dom: integer (nullable = true)
|-- d_qoy: integer (nullable = true)
|-- d_fy_year: integer (nullable = true)
|-- d_fy_quarter_seq: integer (nullable = true)
|-- d_fy_week_seq: integer (nullable = true)
|-- d_day_name: string (nullable = true)
|-- d_quarter_name: string (nullable = true)
|-- d_holiday: string (nullable = true)
|-- d_weekend: string (nullable = true)
|-- d_following_holiday: string (nullable = true)
|-- d_first_dom: integer (nullable = true)
|-- d_last_dom: integer (nullable = true)
|-- d_same_day_ly: integer (nullable = true)
|-- d_same_day_lq: integer (nullable = true)
|-- d_current_day: string (nullable = true)
|-- d_current_week: string (nullable = true)
|-- d_current_month: string (nullable = true)
|-- d_current_quarter: string (nullable = true)
|-- d_current_year: string (nullable = true)oss|
查詢是
select sr_customer_sk as ctr_customer_sk
,sr_store_sk as ctr_store_sk
,sum(sr_return_quantity) as ctr_total_return
from store_returns
,date_dim
where sr_returned_date_sk = d_date_sk
and d_year = 2003
group by sr_customer_sk
,sr_store_sk
同樣,我現在寫了以下uptil
Dataset<Row> df = store_returns
.join(date_dim, store_returns.col("sr_returned_date_sk").equalTo(date_dim.col("d_date_sk")));
df.groupBy("sr_customer_sk","sr_store_sk").agg(sum("sr_return_quantity").alias("ctr_total_return"))
.select(col("sr_returned_date_sk").alias("ctr_customer_sk"),
col("sr_store_sk").alias("ctr_store_sk"))
.where(col("d_year").equalTo("2003").and(col("sr_returned_date_sk").equalTo(col("d_date_sk"))))
.groupBy("sr_customer_sk","sr_store_sk").agg(sum("sr_return_quantity").alias("ctr_total_return")).show();;
我收到以下錯誤
線程“主”中的異常18/04/23 14:31:40 WARN Utils:由於計划太大,因此截斷了計划的字符串表示形式。 可以通過在SparkEnv.conf中設置'spark.debug.maxToStringFields'來調整此行為。 org.apache.spark.sql.AnalysisException:在給定的輸入列下,無法解析'
sr_returned_date_sk
':[sr_customer_sk,sr_store_sk,ctr_total_return]; '項目['sr_returned_date_sk AS ctr_customer_sk#309,sr_store_sk#8L AS ctr_store_sk#310L] +-總計[sr_customer_sk#4L,sr_store_sk#8L],[sr_customer_sk#4L,sr_store_cast#cast(8L) )AS ctr_total_return#304L] +-加入內部,(sr_returned_date_sk#1L = d_date_sk#43L):-Relation [datetime#0L,sr_returned_date_sk#1L,sr_return_time_sk#2L,sr_item_sk_l_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s 6L,sr_addr_sk#7L,sr_store_sk#8L,sr_reason_sk#9L,sr_ticket_number#10L,sr_return_quantity#11,sr_return_amt#12,sr_return_tax#13,sr_return_amt_inc_tax#14,sr_fee#15,sr_return_ship_cost#16,sr_refunded_cash#17,sr_reversed_charge#18, sr_store_credit#19,sr_net_loss#20]實木復合地板+-關系[d_date_sk#43L,d_date_id#44,d_date#45,d_month_seq#46,d_week_seq#47,d_quarter_seq#48,d_year#49,d_dow#50,d_dom_51 #52,#d_qoy 53,d_fy_year#54,d_fy_quarter_seq#55,d_fy_week_seq#56,#d_day_name 57,d_quarter_name#58,#d_holiday 59,d_weekend#60,#d_following_holiday 61,d_first_dom#62,#d_last_dom 63,d _same_day_ly#64,d_same_day_lq#65,d_current_day#66,... 4個其他字段]木地板
df.groupBy("sr_customer_sk","sr_store_sk").agg(sum("sr_return_quantity").alias("ctr_total_return"))
這將導致具有3列sr_customer_sk
, sr_store_sk
, ctr_total_return
數據sr_customer_sk
,由於數據sr_customer_sk
sr_store_sk
, ctr_total_return
對其進行select("sr_returned_date_sk")
sr_returned_date_sk
。
嘗試使用:
Dataset<Row> df = store_returns
.join(date_dim, store_returns.col("sr_returned_date_sk").equalTo(date_dim.col("d_date_sk")))
.where(col("d_year").equalTo("2003"));
df.groupBy("sr_customer_sk","sr_store_sk").agg(sum("sr_return_quantity").alias("ctr_total_return"))
.select(col("sr_customer_sk").alias("ctr_customer_sk"),
col("sr_store_sk").alias("ctr_store_sk"),col("ctr_total_return"))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.