简体   繁体   English

查询Hive QL另一列中与每个值关联的最短字符串值的更有效方法

[英]More efficient way to query shortest string value associated with each value in another column in Hive QL

I have a table in Hive containing store names, order IDs, and User IDs (as well as some other columns including item ID). 我在Hive中有一个表,其中包含商店名称,订单ID和用户ID(以及一些其他列,包括商品ID)。 There is a row in the table for every item purchased (so there can be more than one row per order if the order contains multiple items). 表格中每一项购买的商品都有一行(因此,如果该订单包含多个商品,则每个订单可以有多于一行)。 Order IDs are unique within a store, but not across stores. 订单ID在商店中是唯一的,但在商店中不是唯一的。 A single order can have more than one user ID associated with it. 单个订单可以具有多个关联的用户ID。

I'm trying to write a query that will return a list of all stores and order IDs and the shortest user ID associated with each order. 我正在尝试编写一个查询,该查询将返回所有商店和订单ID以及与每个订单相关的最短用户ID的列表。

So, for example, if the data looks like this: 因此,例如,如果数据如下所示:

 STORE | ORDERID | USERID | ITEMID
 ------+---------+--------+-------
|  a   |    1    |  bill  |  abc  |
|  a   |    1    |  susan |  def  |
|  a   |    2    |  jane  |  abc  |
|  b   |    1    |  scott |  ghi  |
|  b   |    1    |  tony  |  jkl  |

Then the output would look like this: 然后输出将如下所示:

 STORE | ORDERID | USERID 
 ------+---------+-------
   a   |    1    |  bill 
   a   |    2    |  jane 
   b   |    1    |  tony 

I've written a query that will do this, but I feel like there must be a more efficient way to go about it. 我已经编写了一个查询来执行此操作,但是我觉得必须有一种更有效的方法来执行此操作。 Does anybody know a better way to produce these results? 有人知道产生这些结果的更好方法吗?

This is what I have so far: 这是我到目前为止的内容:

select 
    users.store, users.orderid, users.userid
from 
    (select 
         store, orderid, userid, length(userid) as len 
     from 
         sales) users
join 
    (select distinct 
         store, orderid, 
         min(length(userid)) over (partition by store, orderid) as len 
     from 
         sales) len on users.store = len.store
                    and users.orderid = len.orderid
                    and users.len = len.len

Probably rank() is the best way: 也许rank()是最好的方法:

select s.*
from (select s.*, rank() over (partition by store order by length(userid) as seqnum
      from sales s
     ) s
where seqnum = 1;

Check out probably this will work for you, here you can achieve your goal of single "SELECT" clause with no extra overhead on SQL. 看看这可能对您有用,在这里您可以实现单个“ SELECT”子句的目标,而在SQL上没有额外的开销。

select distinct 
    store, orderid, 
    first_value(userid) over(partition by store, orderid order by length(userid) asc) f_val 
from 
    sales;

The result will be: 结果将是:

store   orderid    f_val
a       1          bill
a       2          jane
b       1          tony

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM