简体   繁体   中英

Selecting first row using row_number()

I'm wondering about the performance of ROW_NUMBER(): In my test case, I have many elements with large number (~100k) entries in each element. I would like to select the first entry for each element:

SELECT * FROM(    
    SELECT 
      element, msg, timestamp, 
      ROW_NUMBER() OVER(PARTITION BY element ORDER BY timestamp) as rank
    FROM table
) t1
WHERE rank = 1

It seems excessive to me to need to order all 100k rows for each partition just to keep the first one. Something like a MIN() function should naively perform better. So I tried something like this:

SELECT
  element, MIN(a)[0] as timestamp, MIN(a)[1] as msg
FROM(
    SELECT element, ARRAY(timestamp, msg) as a
    FROM table
) t1
GROUP BY partition

But when comparing the CPU usage for both methods, I'm getting identical performance. Is the ROW_NUMBER() optimizer much smarter than I thought, or am I missing something?

EDIT: Had a bug in my original query. Looking at the results now, it appears like the MIN(ARRAY) method is considerably faster than the ROW_NUMBER() filter. Is there an easier canonical way of implementing a FIRST_ROW OVER( PARTITION) filter without having to define a massive array and then deconstruct it?

You are missing something. The GROUP BY takes a lot of work -- seemingly similar to the ROW_NUMBER() .

I am not intimately familiar with Hive's underlying mechanisms. In most databases, ROW_NUMBER() would take a physical identifier for the row, the various keys, and sort them separately. Using the physical identifier, the value can then be added back quickly to the original data. The detail, of course, might vary by database, but this is the general idea.

Hive probably does not do this. However, whatever it does would be similar to a GROUP BY in terms of effort.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM