简体   繁体   English

在Spark中对JAVA进行SQL查询

[英]SQL Query to JAVA in Spark

I am trying to convert a SQL query to Spark program with JAVA for practice. 我正在尝试使用JAVA将SQL查询转换为Spark程序进行练习。 I am posting the schemas of the two files I am using. 我正在发布我正在使用的两个文件的架构。 Also the query which I am trying to convert 也是我试图转换的查询

Schema of each file: Store_return's schema 每个文件的架构:Store_return的架构

root
 |-- datetime: long (nullable = true)
 |-- sr_returned_date_sk: long (nullable = true)
 |-- sr_return_time_sk: long (nullable = true)
 |-- sr_item_sk: long (nullable = true)
 |-- sr_customer_sk: long (nullable = true)
 |-- sr_cdemo_sk: long (nullable = true)
 |-- sr_hdemo_sk: long (nullable = true)
 |-- sr_addr_sk: long (nullable = true)
 |-- sr_store_sk: long (nullable = true)
 |-- sr_reason_sk: long (nullable = true)
 |-- sr_ticket_number: long (nullable = true)
 |-- sr_return_quantity: integer (nullable = true)
 |-- sr_return_amt: double (nullable = true)
 |-- sr_return_tax: double (nullable = true)
 |-- sr_return_amt_inc_tax: double (nullable = true)
 |-- sr_fee: double (nullable = true)
 |-- sr_return_ship_cost: double (nullable = true)
 |-- sr_refunded_cash: double (nullable = true)
 |-- sr_reversed_charge: double (nullable = true)
 |-- sr_store_credit: double (nullable = true)
 |-- sr_net_loss: double (nullable = true)

date_dim's schema: date_dim的架构:

root
 |-- d_date_sk: long (nullable = true)
 |-- d_date_id: string (nullable = true)
 |-- d_date: string (nullable = true)
 |-- d_month_seq: integer (nullable = true)
 |-- d_week_seq: integer (nullable = true)
 |-- d_quarter_seq: integer (nullable = true)
 |-- d_year: integer (nullable = true)
 |-- d_dow: integer (nullable = true)
 |-- d_moy: integer (nullable = true)
 |-- d_dom: integer (nullable = true)
 |-- d_qoy: integer (nullable = true)
 |-- d_fy_year: integer (nullable = true)
 |-- d_fy_quarter_seq: integer (nullable = true)
 |-- d_fy_week_seq: integer (nullable = true)
 |-- d_day_name: string (nullable = true)
 |-- d_quarter_name: string (nullable = true)
 |-- d_holiday: string (nullable = true)
 |-- d_weekend: string (nullable = true)
 |-- d_following_holiday: string (nullable = true)
 |-- d_first_dom: integer (nullable = true)
 |-- d_last_dom: integer (nullable = true)
 |-- d_same_day_ly: integer (nullable = true)
 |-- d_same_day_lq: integer (nullable = true)
 |-- d_current_day: string (nullable = true)
 |-- d_current_week: string (nullable = true)
 |-- d_current_month: string (nullable = true)
 |-- d_current_quarter: string (nullable = true)
 |-- d_current_year: string (nullable = true)oss|

The query is 查询是

select sr_customer_sk as ctr_customer_sk
      ,sr_store_sk as ctr_store_sk
      ,sum(sr_return_quantity) as ctr_total_return
      from store_returns
      ,date_dim
      where sr_returned_date_sk = d_date_sk
      and d_year = 2003
      group by sr_customer_sk
      ,sr_store_sk

For the same, I have written the following uptil now 同样,我现在写了以下uptil

  Dataset<Row> df = store_returns
              .join(date_dim, store_returns.col("sr_returned_date_sk").equalTo(date_dim.col("d_date_sk")));

      df.groupBy("sr_customer_sk","sr_store_sk").agg(sum("sr_return_quantity").alias("ctr_total_return"))
              .select(col("sr_returned_date_sk").alias("ctr_customer_sk"),
                        col("sr_store_sk").alias("ctr_store_sk"))
        .where(col("d_year").equalTo("2003").and(col("sr_returned_date_sk").equalTo(col("d_date_sk"))))
          .groupBy("sr_customer_sk","sr_store_sk").agg(sum("sr_return_quantity").alias("ctr_total_return")).show();;

I am getting the following error with it 我收到以下错误

Exception in thread "main" 18/04/23 14:31:40 WARN Utils: Truncated the string representation of a plan since it was too large. 线程“主”中的异常18/04/23 14:31:40 WARN Utils:由于计划太大,因此截断了计划的字符串表示形式。 This behavior can be adjusted by setting 'spark.debug.maxToStringFields' in SparkEnv.conf. 可以通过在SparkEnv.conf中设置'spark.debug.maxToStringFields'来调整此行为。 org.apache.spark.sql.AnalysisException: cannot resolve ' sr_returned_date_sk ' given input columns: [sr_customer_sk, sr_store_sk, ctr_total_return];; org.apache.spark.sql.AnalysisException:在给定的输入列下,无法解析' sr_returned_date_sk ':[sr_customer_sk,sr_store_sk,ctr_total_return]; 'Project ['sr_returned_date_sk AS ctr_customer_sk#309, sr_store_sk#8L AS ctr_store_sk#310L] +- Aggregate [sr_customer_sk#4L, sr_store_sk#8L], [sr_customer_sk#4L, sr_store_sk#8L, sum(cast(sr_return_quantity#11 as bigint)) AS ctr_total_return#304L] +- Join Inner, (sr_returned_date_sk#1L = d_date_sk#43L) :- Relation[datetime#0L,sr_returned_date_sk#1L,sr_return_time_sk#2L,sr_item_sk#3L,sr_customer_sk#4L,sr_cdemo_sk#5L,sr_hdemo_sk#6L,sr_addr_sk#7L,sr_store_sk#8L,sr_reason_sk#9L,sr_ticket_number#10L,sr_return_quantity#11,sr_return_amt#12,sr_return_tax#13,sr_return_amt_inc_tax#14,sr_fee#15,sr_return_ship_cost#16,sr_refunded_cash#17,sr_reversed_charge#18,sr_store_credit#19,sr_net_loss#20] parquet +- Relation[d_date_sk#43L,d_date_id#44,d_date#45,d_month_seq#46,d_week_seq#47,d_quarter_seq#48,d_year#49,d_dow#50,d_moy#51,d_dom#52,d_qoy#53,d_fy_year#54,d_fy_quarter_seq#55,d_fy_week_seq#56,d_day_name#57,d_quarter_name#58,d_holiday#59,d_weekend#60,d_following_holiday#61,d_first_dom#62,d_last_dom#63,d '项目['sr_returned_date_sk AS ctr_customer_sk#309,sr_store_sk#8L AS ctr_store_sk#310L] +-总计[sr_customer_sk#4L,sr_store_sk#8L],[sr_customer_sk#4L,sr_store_cast#cast(8L) )AS ctr_total_return#304L] +-加入内部,(sr_returned_date_sk#1L = d_date_sk#43L):-Relation [datetime#0L,sr_returned_date_sk#1L,sr_return_time_sk#2L,sr_item_sk_l_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s_L_sk_s 6L,sr_addr_sk#7L,sr_store_sk#8L,sr_reason_sk#9L,sr_ticket_number#10L,sr_return_quantity#11,sr_return_amt#12,sr_return_tax#13,sr_return_amt_inc_tax#14,sr_fee#15,sr_return_ship_cost#16,sr_refunded_cash#17,sr_reversed_charge#18, sr_store_credit#19,sr_net_loss#20]实木复合地板+-关系[d_date_sk#43L,d_date_id#44,d_date#45,d_month_seq#46,d_week_seq#47,d_quarter_seq#48,d_year#49,d_dow#50,d_dom_51 #52,#d_qoy 53,d_fy_year#54,d_fy_quarter_seq#55,d_fy_week_seq#56,#d_day_name 57,d_quarter_name#58,#d_holiday 59,d_weekend#60,#d_following_holiday 61,d_first_dom#62,#d_last_dom 63,d _same_day_ly#64,d_same_day_lq#65,d_current_day#66,... 4 more fields] parquet _same_day_ly#64,d_same_day_lq#65,d_current_day#66,... 4个其他字段]木地板

df.groupBy("sr_customer_sk","sr_store_sk").agg(sum("sr_return_quantity").alias("ctr_total_return"))

This will result in dataframe with 3 columns sr_customer_sk , sr_store_sk , ctr_total_return on which select("sr_returned_date_sk") will not work because dataframe doesn't have sr_returned_date_sk . 这将导致具有3列sr_customer_sksr_store_skctr_total_return数据sr_customer_sk ,由于数据sr_customer_sk sr_store_skctr_total_return对其进行select("sr_returned_date_sk") sr_returned_date_sk

Try using: 尝试使用:

Dataset<Row> df = store_returns
              .join(date_dim, store_returns.col("sr_returned_date_sk").equalTo(date_dim.col("d_date_sk")))
.where(col("d_year").equalTo("2003"));

df.groupBy("sr_customer_sk","sr_store_sk").agg(sum("sr_return_quantity").alias("ctr_total_return"))
              .select(col("sr_customer_sk").alias("ctr_customer_sk"),
                        col("sr_store_sk").alias("ctr_store_sk"),col("ctr_total_return"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM