[英]Difference between nodupkey in SAS and SELECT * DISTINCT FROM table_name in SQL
I have a data set with 2 fields storing Strings. 我有一个包含2个存储字符串的字段的数据集。 1.In SAS when I do a nodupkey on the dataset I get ~200 records.
1,在SAS上对数据集执行nodupkey时,我获得了约200条记录。 2.In SQL when I do a SELECT DISTINCT / GROUP BY/ PARTITION BY I am getting ~2000 records.
2.在SQL中,当我执行SELECT DISTINCT / GROUP BY / PARTITION BY时,我获得了约2000条记录。 This SQL code is run on HIVE which is hosted on an AWS EMR server.
该SQL代码在AWS EMR服务器上托管的HIVE上运行。
The data set I am working on has NULL in some of the records for on of the fields. 我正在处理的数据集在其中某些字段的某些记录中为NULL。 I am not doing anything else apart from what I mentioned in point 1 and 2.
除了我在第1点和第2点中提到的内容之外,我没有做任何其他事情。
I am looking for explanation as to why there is a huge mismatch between these 2 when I am doing just a simple duplicate removal. 我正在寻找有关为什么当我仅执行简单的重复删除操作时这两个之间存在巨大不匹配的解释。
Distinct operates on all fields in select statement and the database will likely consider nulls and blanks as different. Distinct对select语句中的所有字段进行操作,并且数据库可能会将null和空白视为不同。 SAS does not consider nulls and blanks as different and only filters based on the variables listed in the BY statement.
SAS不会将空值和空格视为不同,而是仅根据BY语句中列出的变量进行过滤。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.