Let's say I have a dataset sample (table 1) as shown below -
Here, one customer can use multiple tokens and one token can be used by multiple customers. I am trying to get for each token, customer and creation date of the record, the number of customers used this token before the creation date.
When I am trying to execute this query in Spark SQL, I am getting the following error -
Option 1 (correlated subquery)
SELECT
t1.token,
t1.customer_id,
t1.creation_date,
(SELECT COUNT(DISTINCT t2.customer_id) FROM Table 1 t2
AND t1.token = t2.token
AND t2.creation_date < t1.creation_date) cust_cnt
FROM Table 1 t1;
Error: Correlated column is not allowed in a non-equality predicate
Option 2 (cross - join)
SELECT
t1.token,
t1.customer_id,
t1.creation_date,
COUNT(DISTINCT t2.customer_id) AS cust_cnt
FROM Table 1 t1, Table 1 t2
WHERE t1.token = t2.token
AND t2.creation_date < t1.creation_date
GROUP BY t1.token, t1.customer_id, t1.creation_date;
Problem: Long running query since Table 1 has millions of rows
Is there any workaround (for eg. using window function) to optimize this query in Spark SQL? Note: window functions does not allow distinct count.
Count the first time a customer appears:
SELECT t1.token, t1.customer_id, t1.creation_date,
SUM(CASE WHEN seqnum = 1 THEN 1 ELSE 0 END) OVER (PARTITION BY token ORDER BY creation_date) as cust_cnt
FROM (SELECT t1.*,
ROW_NUMBER() OVER (PARTITION BY token, customer_id ORDER BY creation_date) as seqnum
FROM Table1 t1
) t1;
Note: This is also counting the current row. I'm guessing that is acceptable for what you want to do.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.