在Spark SQL中使用COUNT和GROUP BY

Question

I'm trying to get pretty basic output that pulls unique NDC Codes for medications and counts the number of unique patients that take each drug. 我正在尝试获得非常基本的输出，以提取唯一的NDC药品编码并计算服用每种药物的唯一患者的数量。 My dataset basically looks like this: 我的数据集基本上是这样的：

patient_id | drug_ndc
---------------------
01         | 250
02         | 725       
03         | 1075
04         | 1075
05         | 250
06         | 250

I want the output to look something like this: 我希望输出看起来像这样：

NDC  | Patients
--------------
250  |  3
1075 |  2
725  |  1

I tried using some queries like this: 我试过使用一些这样的查询：

select distinct drug_ndc as NDC, count patient_id as Patients
from table 1
group by 1
order by 1

But I keep getting errors. 但是我不断出错。 I've tried with and without using an alias, but to no avail. 我试过使用或不使用别名，但都无济于事。

Answer 1

The correct syntax should be: 正确的语法应为：

select drug_ndc as NDC, count(*) as Patients
from table 1
group by drug_ndc
order by 1;

SELECT DISTINCT is almost never appropriate with GROUP BY . SELECT DISTINCT几乎不适合GROUP BY 。 And you can can use COUNT(*) unless the patient id can be NULL . 除非患者ID可以为NULL否则您可以使用COUNT(*) 。

Answer 2

to get the number of unique patients, you should do: 要获得独特患者的数量，您应该执行以下操作：

select drug_ndc as NDC, count(distinct patient_id) as Patients
from table 1
group by drug_ndc;

在Spark SQL中使用COUNT和GROUP BY

问题描述

2 个解决方案

解决方案1
1 2019-09-12 16:15:28

解决方案2
1 已采纳 2019-09-12 18:34:54

在Spark SQL中使用COUNT和GROUP BY

问题描述

2 个解决方案

解决方案1 1 2019-09-12 16:15:28

解决方案2 1 已采纳 2019-09-12 18:34:54

解决方案1
1 2019-09-12 16:15:28

解决方案2
1 已采纳 2019-09-12 18:34:54