Scala：Spark sqlContext查询

Question

I only have 3 events (3rd column) 01, 02, 03 in my file. 我的文件中只有3个事件（第3列）01、02、03。

639393604950|1001|01|2015-05-12 10:00:18|||
639393604950|1002|01|2015-05-12 10:04:18|||
639393604950|1003|01|2015-05-12 10:05:18|||
639393604950|1001|02||2015-05-12 10:40:18||
639393604950|1001|03|||2015-05-12 19:30:18|
639393604950|1002|02|2015-05-12 10:04:18|||

in sqlContext, how do I merge the data by ID? 在sqlContext中，如何按ID合并数据？ I am expecting this for id 1001: 我期望它的ID为1001：

639393604950|1001|01|2015-05-12 10:00:18|2015-05-12 10:40:18|2015-05-12 19:30:18|

Here's my query that needs to be adjusted: 这是我需要调整的查询：

val events = sqlContext.sql("SELECT id, max(date1), max(date2), max(date3) " +
  "FROM parquetFile group by id, date1, date2, date3")
events.collect().foreach(println)

Answer 1

SELECT id, max(date1), max(date2), max(date3) FROM parquetFile group by id

Answer 2

The way data is generated, it looks like schema in file is confusing. 数据的生成方式看起来像文件中的架构令人困惑。 The problem is all dates are populated in date1 field, with different event types. 问题是所有日期都填充在date1字段中，且具有不同的事件类型。 Hence, we need to fix it. 因此，我们需要修复它。

select id, ts, max(d1),max(d2),max(d3)
   from (select id, ts,
                case when eventtype='01' then date1 else null end d1,
                case when eventtype='02' then date1 else null end d2,
                case when eventtype='03' then date1 else null end d3
             from table1
         ) x group by id,ts

of course, this groups id and ts together, as expected in the answer. 当然，正如答案中所期望的，这会将id和ts组合在一起。

Scala：Spark sqlContext查询

问题描述

2 个解决方案

解决方案1
2 已采纳 2015-05-21 07:17:36

解决方案2
0 2015-05-21 13:40:44

Scala：Spark sqlContext查询

问题描述

2 个解决方案

解决方案1 2 已采纳 2015-05-21 07:17:36

解决方案2 0 2015-05-21 13:40:44

解决方案1
2 已采纳 2015-05-21 07:17:36

解决方案2
0 2015-05-21 13:40:44