简体   繁体   English

Django Count 和 Sum 注释相互干扰

[英]Django Count and Sum annotations interfere with each other

While constructing a complexe QuerySet with several annotations, I ran into an issue that I could reproduce with the following simple setup.在构建带有多个注释的复杂QuerySet时,我遇到了一个问题,可以通过以下简单设置重现。

Here are the models:以下是模型:

class Player(models.Model):
    name = models.CharField(max_length=200)

class Unit(models.Model):
    player = models.ForeignKey(Player, on_delete=models.CASCADE,
                               related_name='unit_set')
    rarity = models.IntegerField()

class Weapon(models.Model):
    unit = models.ForeignKey(Unit, on_delete=models.CASCADE,
                             related_name='weapon_set')

With my test database, I obtain the following (correct) results:使用我的测试数据库,我获得了以下(正确的)结果:

Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))

[{'id': 1, 'name': 'James', 'weapon_count': 23},
 {'id': 2, 'name': 'Max', 'weapon_count': 41},
 {'id': 3, 'name': 'Bob', 'weapon_count': 26}]


Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))

[{'id': 1, 'name': 'James', 'rarity_sum': 42},
 {'id': 2, 'name': 'Max', 'rarity_sum': 89},
 {'id': 3, 'name': 'Bob', 'rarity_sum': 67}]

If I now combine both annotations in the same QuerySet , I obtain a different (inaccurate) results:如果我现在在同一个QuerySet组合两个注释,我会得到不同的(不准确的)结果:

Player.objects.annotate(
    weapon_count=Count('unit_set__weapon_set', distinct=True),
    rarity_sum=Sum('unit_set__rarity'))

[{'id': 1, 'name': 'James', 'weapon_count': 23, 'rarity_sum': 99},
 {'id': 2, 'name': 'Max', 'weapon_count': 41, 'rarity_sum': 183},
 {'id': 3, 'name': 'Bob', 'weapon_count': 26, 'rarity_sum': 113}]

Notice how rarity_sum have now different values than before.请注意rarity_sum现在与以前的值rarity_sum不同。 Removing distinct=True does not affect the result.删除distinct=True不会影响结果。 I also tried to use the DistinctSum function from this answer , in which case all rarity_sum are set to 18 (also inaccurate).我还尝试使用此答案中DistinctSum函数,在这种情况下,所有rarity_sum都设置为18 (也不准确)。

Why is this?为什么是这样? How can I combine both annotations in the same QuerySet ?如何在同一个QuerySet组合两个注释?

Edit : here is the sqlite query generated by the combined QuerySet:编辑:这是由组合的 QuerySet 生成的 sqlite 查询:

SELECT "sandbox_player"."id",
       "sandbox_player"."name",
       COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
       SUM("sandbox_unit"."rarity")          AS "rarity_sum"
FROM "sandbox_player"
         LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
         LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

The data used for the results above is available here .用于上述结果的数据可在此处获得

This isn't the problem with Django ORM, this is just the way relational databases work.这不是 Django ORM 的问题,这只是关系数据库的工作方式。 When you're constructing simple querysets like当你构建简单的查询集时

Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))

or或者

Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))

ORM does exactly what you expect it to do - join Player with Weapon ORM 完全符合您的期望 - 加入Player with Weapon

SELECT "sandbox_player"."id", "sandbox_player"."name", COUNT("sandbox_weapon"."id") AS "weapon_count"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" 
    ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon" 
    ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

or Player with Unit或有Unit Player

SELECT "sandbox_player"."id", "sandbox_player"."name", SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

and perform either COUNT or SUM aggregation on them.并对它们执行COUNTSUM聚合。

Note that although the first query has two joins between three tables, the intermediate table Unit is neither in columns referenced in SELECT , nor in the GROUP BY clause.请注意,尽管第一个查询在三个表之间有两个连接,但中间表Unit既不在SELECT引用的列中,也不在GROUP BY子句中。 The only role that Unit plays here is to join Player with Weapon . Unit在这里扮演的唯一角色就是加入Player with Weapon

Now if you look at your third queryset, things get more complicated.现在,如果您查看第三个查询集,事情会变得更加复杂。 Again, as in the first query the joins are between three tables, but now Unit is referenced in SELECT as there is SUM aggregation for Unit.rarity :同样,在第一个查询中,连接在三个表之间,但现在UnitSELECT被引用,因为Unit.raritySUM聚合:

SELECT "sandbox_player"."id",
       "sandbox_player"."name",
       COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
       SUM("sandbox_unit"."rarity")          AS "rarity_sum"
FROM "sandbox_player"
         LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
         LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"

And this is the crucial difference between the second and the third queries.这是第二个和第三个查询之间的关键区别。 In the second query, you're joining Player to Unit , so a single Unit will be listed once for each player that it references.在第二个查询中,您将Player加入到Unit ,因此将针对它引用的每个玩家列出一个Unit

But in the third query you're joining Player to Unit and then Unit to Weapon , so not only a single Unit will be listed once for each player that it references, but also for each weapon that references Unit .但是在第三个查询中,您将Player连接到Unit ,然后将UnitWeapon ,因此不仅会为它引用的每个玩家列出一个Unit还会为引用Unit每个武器列出一次。

Let's take a look at the simple example:我们来看一个简单的例子:

insert into sandbox_player values (1, "player_1");

insert into sandbox_unit values(1, 10, 1);

insert into sandbox_weapon values (1, 1), (2, 1);

One player, one unit and two weapons that reference the same unit.一名玩家、一个单位和两件引用同一单位的武器。

Confirm that the problem exists:确认问题存在:

>>> from sandbox.models import Player
>>> from django.db.models import Count, Sum

>>> Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2}]>

>>> Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'rarity_sum': 10}]>


>>> Player.objects.annotate(
...     weapon_count=Count('unit_set__weapon_set', distinct=True),
...     rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 20}]>

From this example it's easy to see that the problem is that in the combined query the unit will be listed twice, one time for each of the weapons referencing it:从这个例子中很容易看出问题在于,在组合查询中,单位将被列出两次,每次引用它的武器一次:

sqlite> SELECT "sandbox_player"."id",
   ...>        "sandbox_player"."name",
   ...>        "sandbox_weapon"."id",
   ...>        "sandbox_unit"."rarity"
   ...> FROM "sandbox_player"
   ...>          LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
   ...>          LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id");
id          name        id          rarity    
----------  ----------  ----------  ----------
1           player_1    1           10        
1           player_1    2           10   

What should you do?你该怎么办?

As @ivissani mentioned, one of the easiest solutions would be to write subqueries for each of the aggregations:正如@ivissani 提到的,最简单的解决方案之一是为每个聚合编写子查询:

>>> from django.db.models import Count, IntegerField, OuterRef, Subquery, Sum
>>> weapon_count = Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).filter(pk=OuterRef('pk'))
>>> rarity_sum = Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).filter(pk=OuterRef('pk'))
>>> qs = Player.objects.annotate(
...     weapon_count=Subquery(weapon_count.values('weapon_count'), output_field=IntegerField()),
...     rarity_sum=Subquery(rarity_sum.values('rarity_sum'), output_field=IntegerField())
... )
>>> qs.values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 10}]>

which produces the following SQL产生以下 SQL

SELECT "sandbox_player"."id", "sandbox_player"."name", 
(
    SELECT COUNT(U2."id") AS "weapon_count"
    FROM "sandbox_player" U0 
    LEFT OUTER JOIN "sandbox_unit" U1
        ON (U0."id" = U1."player_id")
    LEFT OUTER JOIN "sandbox_weapon" U2 
        ON (U1."id" = U2."unit_id")
    WHERE U0."id" = ("sandbox_player"."id") 
    GROUP BY U0."id", U0."name"
) AS "weapon_count", 
(
    SELECT SUM(U1."rarity") AS "rarity_sum"
    FROM "sandbox_player" U0
    LEFT OUTER JOIN "sandbox_unit" U1
        ON (U0."id" = U1."player_id")
    WHERE U0."id" = ("sandbox_player"."id")
GROUP BY U0."id", U0."name") AS "rarity_sum"
FROM "sandbox_player"

A few notes to complement rktavi's excellent answer:补充rktavi出色答案的一些说明:

1) This issues has apparently been considered a bug for 10 years already. 1)这个问题显然已经被认为是 10 年的错误 It is even referred to in the official documentation .它甚至在官方文档中被提及。

2) While converting my actual project's QuerySets to subqueries (as per rktavi's answer), I noticed that combining bare-bone annotations (for the distinct=True counts that always worked correctly) with a Subquery (for the sums) yields extremely long processing (35 sec vs. 100 ms) and incorrect results for the sum. 2)在将我的实际项目的 QuerySets 转换为子查询时(根据 rktavi 的回答),我注意到将基本注释(对于始终正确工作的distinct=True计数)与Subquery (对于总和)相结合会产生极长的处理时间( 35 秒与 100 毫秒)不正确的总和结果。 This is true in my actual setup (11 filtered counts on various nested relations and 1 filtered sum on a multiply-nested relation, SQLite3) but cannot be reproduced with the simple models above.这在我的实际设置中是正确的(各种嵌套关系的 11 个过滤计数和多重嵌套关系的 1 个过滤总和,SQLite3),但不能用上面的简单模型重现。 This issue can be tricky because another part of your code could add an annotation to your QuerySet (eg a Table.order_FOO() function), leading to the issue.这个问题可能很棘手,因为您的代码的另一部分可能会向您的 QuerySet 添加注释(例如Table.order_FOO()函数),从而导致问题。

3) With the same setup, I have anecdotical evidence that subquery-type QuerySets are faster compared to bare-bone annotation QuerySets (in cases where you have only distinct=True counts, of course). 3)使用相同的设置,我有轶事证据表明子查询类型的 QuerySets 与基本注释 QuerySets 相比更快(当然,在您只有distinct=True计数的情况下)。 I could observe this both with local SQLite3 (83 ms vs 260 ms) and hosted PostgreSQL (320 ms vs 540 ms).我可以使用本地 SQLite3(83 毫秒 vs 260 毫秒)和托管 PostgreSQL(320 毫秒 vs 540 毫秒)观察到这一点。

As a result of the above, I will completely avoid using bare-bone annotations in favour of subqueries.由于上述原因,我将完全避免使用有利于子查询的准系统注释。

Based on the excellent answer from @rktavi, I created two helpers classes that simplify the Subquery / Count and Subquery / Sum patterns:基于从@rktavi的出色答卷,我创建了两个助手类,简化了Subquery / CountSubquery / Sum模式:

class SubqueryCount(Subquery):
    template = "(SELECT count(*) FROM (%(subquery)s) _count)"
    output_field = PositiveIntegerField()


class SubquerySum(Subquery):
    template = '(SELECT sum(_sum."%(column)s") FROM (%(subquery)s) _sum)'

    def __init__(self, queryset, column, output_field=None, **extra):
        if output_field is None:
            output_field = queryset.model._meta.get_field(column)
        super().__init__(queryset, output_field, column=column, **extra)

One can use these helpers like so:可以像这样使用这些助手:

from django.db.models import OuterRef

weapons = Weapon.objects.filter(unit__player_id=OuterRef('id'))
units = Unit.objects.filter(player_id=OuterRef('id'))

qs = Player.objects.annotate(weapon_count=SubqueryCount(weapons),
                             rarity_sum=SubquerySum(units, 'rarity'))

Thanks @ rktavi for your amazing answer!!感谢@rktavi的精彩回答!!

Here's my use case:这是我的用例:

Using Django DRF.使用 Django DRF。

I needed to get Sum and Count from different FK's inside the annotate so that it would all be part of one queryset in order to add these fields to the ordering_fields in DRF.我需要从注释内的不同 FK 中获取 Sum 和 Count,以便它们都成为一个查询集的一部分,以便将这些字段添加到 DRF 中的 ordering_fields。

The Sum and Count were clashing and returning wrong results. Sum 和 Count 发生冲突并返回错误的结果。 Your answer really helped me put it all together.你的回答真的帮助我把它放在一起。

The annotate was occasionally returning the dates as strings , so I needed to Cast it to DateTimeField.注释偶尔会将日期作为strings返回,因此我需要将其转换为 DateTimeField。

    donation_filter =  Q(payments__status='donated') & ~Q(payments__payment_type__payment_type='coupon')
    total_donated_SQ = User.objects.annotate(total_donated=Sum('payments__sum', filter=donation_filter )).filter(pk=OuterRef('pk'))
    message_count_SQ = User.objects.annotate(message_count=Count('events__id', filter=Q(events__event_id=6))).filter(pk=OuterRef('pk'))
    queryset = User.objects.annotate(
        total_donated=Subquery(total_donated_SQ.values('total_donated'), output_field=IntegerField()),
        last_donation_date=Cast(Max('payments__updated', filter=donation_filter ), output_field=DateTimeField()),
        message_count=Subquery(message_count_SQ.values('message_count'), output_field=IntegerField()),
        last_message_date=Cast(Max('events__updated', filter=Q(events__event_id=6)), output_field=DateTimeField())
    )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM