在 databricks 中查詢數據 spark SQL

Question

我對databricks SQL非常陌生。 我需要一些幫助來完成某項任務。

我有一個表 POLICY，其中包含以下列：

策略 ID	客戶ID	代理 ID
P123	C123	A123
P124	C124	A124
P125	C123	A125
P126	C124	A124

我需要確定屬於同一客戶的保單，該客戶的代理與該客戶的其他保單不同。

在此示例中，客戶 C124 下的 policyid P124 和 P126 的策略是安全的，因為它們具有相同的代理 - agentId A124。 但是，客戶 C123 的策略 P123 P125 被標記為具有不同的代理。

基本上，只要客戶在同一個代理下，他們就可以擁有超過 1 個保單。 屬於同一客戶但具有不同服務代理的任何策略都會被標記。

如何在 Databricks SQL 中實現這一點？ 到目前為止，我只使用 collect_list() function 聚合確定了屬於同一客戶的策略。

select customerid,collect_list(distinct policyid) from Policy group by customerid

客戶ID	收集列表（策略 ID）
C123	["P123","P125"]
C124	["P124","P126"]

編輯：嘗試了下面的解決方案並且它有效。 但是要求有一個小的變化。 使用下面查詢中的數據集，我現在需要確定同一客戶下的策略，具有不同的代理，和/而這些代理屬於同一組。

我有一個具有組碼的代理的查找表

代理 ID	組碼
A123	1
A124	2
A125	2
A126	2

Answer 1

由於您想要獲取（標記）他們至少有一個策略但不同代理的customerId ，您可以使用customerId和AgentId本身編寫查詢。 我使用了作為示例給出的相同數據。

當 distinct AgentId的計數大於 1 時，以下查詢提供customerID的詳細信息。

select customerId, count(distinct(AgentId)) from policy group by customerId

Output

+----------+-----------------------+
|customerId|count(DISTINCT AgentId)|
+----------+-----------------------+
|      C123|                      2|
|      C124|                      1|
+----------+-----------------------+

現在，由於我們要標記具有多個代理的客戶，您可以使用以下查詢來獲取完整的詳細信息。

select customerId,collect_list(distinct AgentId),count(policyId),collect_list(distinct policyId) from policy group by customerId having count(distinct(agentId))>1

Output

+----------+------------------------------+---------------+-------------------------------+
|customerId|collect_list(DISTINCT AgentId)|count(policyId)|collect_list(DISTINCT policyId)|
+----------+------------------------------+---------------+-------------------------------+
|      C123|                  [A123, A125]|              2|                   [P123, P125]|
+----------+------------------------------+---------------+-------------------------------+

更新：

您可以使用以下查詢創建視圖。 您可以使用該視圖來查詢和獲取必要的結果。

create view flagged_customers as select customerId,collect_list(distinct AgentId) as distinct_agent_list,count(policyId) as policy_count,collect_list(distinct policyId) as distinct_policy_list from policy group by customerId having count(distinct(agentId))>1

--select * from flagged_customers

Output：

+----------+-------------------+------------+--------------------+
|customerId|distinct_agent_list|policy_count|distinct_policy_list|
+----------+-------------------+------------+--------------------+
|      C123|       [A123, A125]|           2|        [P123, P125]|
+----------+-------------------+------------+--------------------+

如果您想要完整的行信息而不需要 arrays 以便查詢，您可以在policy表和flagged_customers視圖上使用內連接，如下所示：

select t1.* from policy as t1 inner join flagged_customers as t2 on t1.customerid=t2.customerid

Output：

+--------+----------+-------+
|PolicyId|CustomerId|AgentId|
+--------+----------+-------+
|    P123|      C123|   A123|
|    P125|      C123|   A125|
+--------+----------+-------+

在 databricks 中查詢數據 spark SQL

問題描述

1 個解決方案

解決方案1
1 已采納 2022-08-16 04:43:25

在 databricks 中查詢數據 spark SQL

問題描述

1 個解決方案

解決方案1 1 已采納 2022-08-16 04:43:25

解決方案1
1 已采納 2022-08-16 04:43:25