简体   繁体   English

无法使用 WINDOW 函数(Spark SQL)计算 DISTINCT

[英]Not able to COUNT DISTINCT using WINDOW functions (Spark SQL)

Let's say I have a dataset sample (table 1) as shown below -假设我有一个数据集样本(表 1),如下所示 -

表格1

Here, one customer can use multiple tokens and one token can be used by multiple customers.在这里,一个客户可以使用多个令牌,一个令牌可以被多个客户使用。 I am trying to get for each token, customer and creation date of the record, the number of customers used this token before the creation date.我正在尝试获取记录的每个令牌、客户和创建日期,以及在创建日期之前使用此令牌的客户数量。

When I am trying to execute this query in Spark SQL, I am getting the following error -当我尝试在 Spark SQL 中执行此查询时,我收到以下错误 -

Option 1 (correlated subquery)选项 1(相关子查询)

SELECT 
t1.token, 
t1.customer_id, 
t1.creation_date,
(SELECT COUNT(DISTINCT t2.customer_id) FROM Table 1  t2
AND t1.token = t2.token 
AND t2.creation_date < t1.creation_date) cust_cnt
FROM Table 1  t1;

Error: Correlated column is not allowed in a non-equality predicate错误:非等式谓词中不允许相关列

Option 2 (cross - join)选项 2(交叉连接)

SELECT 
t1.token, 
t1.customer_id, 
t1.creation_date, 
COUNT(DISTINCT t2.customer_id) AS cust_cnt
FROM Table 1 t1, Table 1 t2
WHERE t1.token = t2.token
AND t2.creation_date < t1.creation_date 
GROUP BY t1.token, t1.customer_id, t1.creation_date;

Problem: Long running query since Table 1 has millions of rows问题:长时间运行的查询,因为表 1 有数百万行

Is there any workaround (for eg. using window function) to optimize this query in Spark SQL?是否有任何解决方法(例如,使用 window 函数)来优化 Spark SQL 中的此查询? Note: window functions does not allow distinct count.注意:window 函数不允许不同计数。

Count the first time a customer appears:计算客户第一次出现的次数:

SELECT t1.token, t1.customer_id, t1.creation_date,
       SUM(CASE WHEN seqnum = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY token ORDER BY creation_date) as cust_cnt
FROM (SELECT t1.*,
             ROW_NUMBER() OVER (PARTITION BY token, customer_id ORDER BY creation_date) as seqnum
      FROM Table1  t1
     ) t1;

Note: This is also counting the current row.注意:这也计算当前行。 I'm guessing that is acceptable for what you want to do.我猜这对于你想做的事情是可以接受的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM