简体   繁体   English

Scala:Spark sqlContext查询

[英]Scala: Spark sqlContext query

I only have 3 events (3rd column) 01, 02, 03 in my file. 我的文件中只有3个事件(第3列)01、02、03。

the schema is unixTimestamp|id|eventType|date1|date2|date3 模式为unixTimestamp | id | eventType | date1 | date2 | date3

639393604950|1001|01|2015-05-12 10:00:18|||
639393604950|1002|01|2015-05-12 10:04:18|||
639393604950|1003|01|2015-05-12 10:05:18|||
639393604950|1001|02||2015-05-12 10:40:18||
639393604950|1001|03|||2015-05-12 19:30:18|
639393604950|1002|02|2015-05-12 10:04:18|||

in sqlContext, how do I merge the data by ID? 在sqlContext中,如何按ID合并数据? I am expecting this for id 1001: 我期望它的ID为1001:

639393604950|1001|01|2015-05-12 10:00:18|2015-05-12 10:40:18|2015-05-12 19:30:18|

Here's my query that needs to be adjusted: 这是我需要调整的查询:

val events = sqlContext.sql("SELECT id, max(date1), max(date2), max(date3) " +
  "FROM parquetFile group by id, date1, date2, date3")
events.collect().foreach(println)
SELECT id, max(date1), max(date2), max(date3) FROM parquetFile group by id

The way data is generated, it looks like schema in file is confusing. 数据的生成方式看起来像文件中的架构令人困惑。 The problem is all dates are populated in date1 field, with different event types. 问题是所有日期都填充在date1字段中,且具有不同的事件类型。 Hence, we need to fix it. 因此,我们需要修复它。

select id, ts, max(d1),max(d2),max(d3)
   from (select id, ts,
                case when eventtype='01' then date1 else null end d1,
                case when eventtype='02' then date1 else null end d2,
                case when eventtype='03' then date1 else null end d3
             from table1
         ) x group by id,ts

of course, this groups id and ts together, as expected in the answer. 当然,正如答案中所期望的,这会将id和ts组合在一起。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM