简体   繁体   English

如何获取最频繁的值SQL

[英]How to get the most frequent value SQL

I have a table Orders(id_trip, id_order), table Trip(id_hotel, id_bus, id_type_of_trip) and table Hotel(id_hotel, name).我有一个表 Orders(id_trip, id_order)、表 Trip(id_hotel, id_bus, id_type_of_trip) 和表 Hotel(id_hotel, name)。

I would like to get name of the most frequent hotel in table Orders.我想在订单表中获得最常出现的酒店的名称。

SELECT hotel.name from Orders
 JOIN Trip
 on Orders.id_trip = Trip.id_hotel
 JOIN hotel
 on trip.id_hotel = hotel.id_hotel
  FROM (SELECT hotel.name, rank() over (order by cnt desc) rnk
          FROM (SELECT hotel.name, count(*) cnt
                  FROM Orders
                 GROUP BY hotel.name))
 WHERE rnk = 1;

The "most frequently occurring value" in a distribution is a distinct concept in statistics, with a technical name.分布中“最常出现的值”是统计学中的一个独特概念,具有技术名称。 It's called the MODE of the distribution.它被称为分布的模式。 And Oracle has the STATS_MODE() function for it. Oracle 有STATS_MODE()函数。 https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions154.htm https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions154.htm

For example, using the EMP table in the standard SCOTT schema, select stats_mode(deptno) from scott.emp will return 30 - the number of the department with the most employees.例如,使用标准SCOTT模式中的EMP表, select stats_mode(deptno) from scott.emp将返回30 - 拥有最多员工的部门的编号。 (30 is the department "name" or number, it is NOT the number of employees in that department!) (30 是部门“名称”或编号,不是该部门的员工人数!)

In your case:在你的情况下:

select stats_mode(h.name) from (the rest of your query)

Note : if two or more hotels are tied for "most frequent", then STATS_MODE() will return one of them (non-deterministic).注意:如果两个或更多酒店被绑定为“最频繁”,那么STATS_MODE()将返回其中之一(非确定性)。 If you need all the tied values, you will need a different solution - a good example is in the documentation (linked above).如果您需要所有绑定值,您将需要一个不同的解决方案 - 一个很好的例子在文档中(上面链接)。 This is a documented flaw in Oracle's understanding and implementation of the statistical concept.这是 Oracle 对统计概念的理解和实现中的一个记录缺陷。

Use FIRST for a single result:对单个结果使用FIRST

SELECT MAX(hotel.name) KEEP (DENSE_RANK FIRST ORDER BY cnt DESC) 
FROM (
  SELECT hotel.name, COUNT(*) cnt
  FROM orders
  JOIN trip USING (id_trip)
  JOIN hotel USING (id_hotel)
  GROUP BY hotel.name
) t

Here is one method:这是一种方法:

select name
from (select h.name,
             row_number() over (order by count(*) desc) as seqnum  -- use `rank()` if you want duplicates
      from orders o join
           trip t 
           on o.id_trip = t.id_trip join -- this seems like the right join condition
           hotels h
           on t.id_hotel = h.id_hotel
     ) oth
where seqnum = 1;

** Getting the most recent statistical mode out of a data sample ** ** 从数据样本中获取最新的统计模式 **

I know it's more than a year, but here's my answer.我知道已经一年多了,但这是我的答案。 I came across this question hoping to find a simpler solution than what I know, but alas, nope.我遇到了这个问题,希望找到一个比我所知道的更简单的解决方案,但唉,不。

I had a similar situation where I needed to get the mode from a data sample, with the requirement to get the mode of the most recently inserted value if there were multiple modes.我有一个类似的情况,我需要从数据样本中获取模式,如果有多种模式,则需要获取最近插入的值的模式。

In such a case neither the STATS_MODE nor the LAST aggregate functions would do (as they would tend to return the first mode found, not necessarily the mode with the most recent entries.)在这种情况下, STATS_MODELAST聚合函数都不会这样做(因为它们往往会返回找到的第一个模式,不一定是具有最新条目的模式。)

In my case it was easy to use the ROWNUM pseudo-column because the tables in question were performance metric tables that only experienced inserts (not updates)在我的情况下,使用ROWNUM伪列很容易,因为有问题的表是只经历插入(而不是更新)的性能指标表

In this oversimplified example, I'm using ROWNUM - it could easily be changed to a timestamp or sequence field if you have one.在这个过于简化的示例中,我使用的是ROWNUM - 如果您有时间戳或序列字段,它可以轻松更改为时间戳或序列字段。

 SELECT     VALUE
       FROM
        (SELECT     VALUE        ,
                COUNT( * ) CNT,
                MAX( R ) R
               FROM
                ( SELECT ID, ROWNUM R FROM FOO
                )
           GROUP BY ID
           ORDER BY CNT DESC,
                R DESC
        )
      WHERE
        (
            ROWNUM < 2
        );

That is, get the total count and max ROWNUM for each value (I'm assuming the values are discrete. If they aren't, this ain't gonna work. )也就是说,获取每个值的总计数和最大ROWNUM (我假设这些值是离散的。如果不是,这将不起作用。

Then sort so that the ones with largest counts come first, and for those with the same count, the one with the largest ROWNUM (indicating most recent insertion in my case ).然后排序,使计数最大的那些排在最前面,对于那些具有相同计数的排,具有最大ROWNUM最前面(在我的情况下表示最近插入)。

Then skim off the top row.然后撇去顶行。

Your specific data model should have a way to discern the most recent (or the oldest or whatever) rows inserted in your table, and if there are collisions, then there's not much of a way other than using ROWNUM or getting a random sample of size 1.您的特定数据模型应该有一种方法可以识别插入表中的最新(或最旧的或其他)行,并且如果存在冲突,那么除了使用ROWNUM或获取大小的随机样本之外ROWNUM他法1.

If this doesn't work for your specific case, you'll have to create your own custom aggregator.如果这不适用于您的特定情况,则您必须创建自己的自定义聚合器。

Now, if you don't care which mode Oracle is going to pick (your bizness case just requires a mode and that's it, then STATS_MODE will do fine.现在,如果您不关心 Oracle 将选择哪种模式(您的商业案例只需要一种模式,仅此而已,那么STATS_MODE就可以了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM