[英]Django Count and Sum annotations interfere with each other
While constructing a complexe QuerySet
with several annotations, I ran into an issue that I could reproduce with the following simple setup.在构建带有多个注释的复杂
QuerySet
时,我遇到了一个问题,可以通过以下简单设置重现。
Here are the models:以下是模型:
class Player(models.Model):
name = models.CharField(max_length=200)
class Unit(models.Model):
player = models.ForeignKey(Player, on_delete=models.CASCADE,
related_name='unit_set')
rarity = models.IntegerField()
class Weapon(models.Model):
unit = models.ForeignKey(Unit, on_delete=models.CASCADE,
related_name='weapon_set')
With my test database, I obtain the following (correct) results:使用我的测试数据库,我获得了以下(正确的)结果:
Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))
[{'id': 1, 'name': 'James', 'weapon_count': 23},
{'id': 2, 'name': 'Max', 'weapon_count': 41},
{'id': 3, 'name': 'Bob', 'weapon_count': 26}]
Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))
[{'id': 1, 'name': 'James', 'rarity_sum': 42},
{'id': 2, 'name': 'Max', 'rarity_sum': 89},
{'id': 3, 'name': 'Bob', 'rarity_sum': 67}]
If I now combine both annotations in the same QuerySet
, I obtain a different (inaccurate) results:如果我现在在同一个
QuerySet
组合两个注释,我会得到不同的(不准确的)结果:
Player.objects.annotate(
weapon_count=Count('unit_set__weapon_set', distinct=True),
rarity_sum=Sum('unit_set__rarity'))
[{'id': 1, 'name': 'James', 'weapon_count': 23, 'rarity_sum': 99},
{'id': 2, 'name': 'Max', 'weapon_count': 41, 'rarity_sum': 183},
{'id': 3, 'name': 'Bob', 'weapon_count': 26, 'rarity_sum': 113}]
Notice how rarity_sum
have now different values than before.请注意
rarity_sum
现在与以前的值rarity_sum
不同。 Removing distinct=True
does not affect the result.删除
distinct=True
不会影响结果。 I also tried to use the DistinctSum
function from this answer , in which case all rarity_sum
are set to 18
(also inaccurate).我还尝试使用此答案中的
DistinctSum
函数,在这种情况下,所有rarity_sum
都设置为18
(也不准确)。
Why is this?为什么是这样? How can I combine both annotations in the same
QuerySet
?如何在同一个
QuerySet
组合两个注释?
Edit : here is the sqlite query generated by the combined QuerySet:编辑:这是由组合的 QuerySet 生成的 sqlite 查询:
SELECT "sandbox_player"."id",
"sandbox_player"."name",
COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
The data used for the results above is available here .用于上述结果的数据可在此处获得。
This isn't the problem with Django ORM, this is just the way relational databases work.这不是 Django ORM 的问题,这只是关系数据库的工作方式。 When you're constructing simple querysets like
当你构建简单的查询集时
Player.objects.annotate(weapon_count=Count('unit_set__weapon_set'))
or或者
Player.objects.annotate(rarity_sum=Sum('unit_set__rarity'))
ORM does exactly what you expect it to do - join Player
with Weapon
ORM 完全符合您的期望 - 加入
Player
with Weapon
SELECT "sandbox_player"."id", "sandbox_player"."name", COUNT("sandbox_weapon"."id") AS "weapon_count"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit"
ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon"
ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
or Player
with Unit
或有
Unit
Player
SELECT "sandbox_player"."id", "sandbox_player"."name", SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
and perform either COUNT
or SUM
aggregation on them.并对它们执行
COUNT
或SUM
聚合。
Note that although the first query has two joins between three tables, the intermediate table Unit
is neither in columns referenced in SELECT
, nor in the GROUP BY
clause.请注意,尽管第一个查询在三个表之间有两个连接,但中间表
Unit
既不在SELECT
引用的列中,也不在GROUP BY
子句中。 The only role that Unit
plays here is to join Player
with Weapon
. Unit
在这里扮演的唯一角色就是加入Player
with Weapon
。
Now if you look at your third queryset, things get more complicated.现在,如果您查看第三个查询集,事情会变得更加复杂。 Again, as in the first query the joins are between three tables, but now
Unit
is referenced in SELECT
as there is SUM
aggregation for Unit.rarity
:同样,在第一个查询中,连接在三个表之间,但现在
Unit
在SELECT
被引用,因为Unit.rarity
有SUM
聚合:
SELECT "sandbox_player"."id",
"sandbox_player"."name",
COUNT(DISTINCT "sandbox_weapon"."id") AS "weapon_count",
SUM("sandbox_unit"."rarity") AS "rarity_sum"
FROM "sandbox_player"
LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id")
GROUP BY "sandbox_player"."id", "sandbox_player"."name"
And this is the crucial difference between the second and the third queries.这是第二个和第三个查询之间的关键区别。 In the second query, you're joining
Player
to Unit
, so a single Unit
will be listed once for each player that it references.在第二个查询中,您将
Player
加入到Unit
,因此将针对它引用的每个玩家列出一个Unit
。
But in the third query you're joining Player
to Unit
and then Unit
to Weapon
, so not only a single Unit
will be listed once for each player that it references, but also for each weapon that references Unit
.但是在第三个查询中,您将
Player
连接到Unit
,然后将Unit
到Weapon
,因此不仅会为它引用的每个玩家列出一个Unit
,还会为引用Unit
每个武器列出一次。
Let's take a look at the simple example:我们来看一个简单的例子:
insert into sandbox_player values (1, "player_1");
insert into sandbox_unit values(1, 10, 1);
insert into sandbox_weapon values (1, 1), (2, 1);
One player, one unit and two weapons that reference the same unit.一名玩家、一个单位和两件引用同一单位的武器。
Confirm that the problem exists:确认问题存在:
>>> from sandbox.models import Player
>>> from django.db.models import Count, Sum
>>> Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2}]>
>>> Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'rarity_sum': 10}]>
>>> Player.objects.annotate(
... weapon_count=Count('unit_set__weapon_set', distinct=True),
... rarity_sum=Sum('unit_set__rarity')).values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 20}]>
From this example it's easy to see that the problem is that in the combined query the unit will be listed twice, one time for each of the weapons referencing it:从这个例子中很容易看出问题在于,在组合查询中,单位将被列出两次,每次引用它的武器一次:
sqlite> SELECT "sandbox_player"."id",
...> "sandbox_player"."name",
...> "sandbox_weapon"."id",
...> "sandbox_unit"."rarity"
...> FROM "sandbox_player"
...> LEFT OUTER JOIN "sandbox_unit" ON ("sandbox_player"."id" = "sandbox_unit"."player_id")
...> LEFT OUTER JOIN "sandbox_weapon" ON ("sandbox_unit"."id" = "sandbox_weapon"."unit_id");
id name id rarity
---------- ---------- ---------- ----------
1 player_1 1 10
1 player_1 2 10
As @ivissani mentioned, one of the easiest solutions would be to write subqueries for each of the aggregations:正如@ivissani 提到的,最简单的解决方案之一是为每个聚合编写子查询:
>>> from django.db.models import Count, IntegerField, OuterRef, Subquery, Sum
>>> weapon_count = Player.objects.annotate(weapon_count=Count('unit_set__weapon_set')).filter(pk=OuterRef('pk'))
>>> rarity_sum = Player.objects.annotate(rarity_sum=Sum('unit_set__rarity')).filter(pk=OuterRef('pk'))
>>> qs = Player.objects.annotate(
... weapon_count=Subquery(weapon_count.values('weapon_count'), output_field=IntegerField()),
... rarity_sum=Subquery(rarity_sum.values('rarity_sum'), output_field=IntegerField())
... )
>>> qs.values()
<QuerySet [{'id': 1, 'name': 'player_1', 'weapon_count': 2, 'rarity_sum': 10}]>
which produces the following SQL产生以下 SQL
SELECT "sandbox_player"."id", "sandbox_player"."name",
(
SELECT COUNT(U2."id") AS "weapon_count"
FROM "sandbox_player" U0
LEFT OUTER JOIN "sandbox_unit" U1
ON (U0."id" = U1."player_id")
LEFT OUTER JOIN "sandbox_weapon" U2
ON (U1."id" = U2."unit_id")
WHERE U0."id" = ("sandbox_player"."id")
GROUP BY U0."id", U0."name"
) AS "weapon_count",
(
SELECT SUM(U1."rarity") AS "rarity_sum"
FROM "sandbox_player" U0
LEFT OUTER JOIN "sandbox_unit" U1
ON (U0."id" = U1."player_id")
WHERE U0."id" = ("sandbox_player"."id")
GROUP BY U0."id", U0."name") AS "rarity_sum"
FROM "sandbox_player"
A few notes to complement rktavi's excellent answer:补充rktavi出色答案的一些说明:
1) This issues has apparently been considered a bug for 10 years already. 1)这个问题显然已经被认为是 10 年的错误。 It is even referred to in the official documentation .
它甚至在官方文档中被提及。
2) While converting my actual project's QuerySets to subqueries (as per rktavi's answer), I noticed that combining bare-bone annotations (for the distinct=True
counts that always worked correctly) with a Subquery
(for the sums) yields extremely long processing (35 sec vs. 100 ms) and incorrect results for the sum. 2)在将我的实际项目的 QuerySets 转换为子查询时(根据 rktavi 的回答),我注意到将基本注释(对于始终正确工作的
distinct=True
计数)与Subquery
(对于总和)相结合会产生极长的处理时间( 35 秒与 100 毫秒)和不正确的总和结果。 This is true in my actual setup (11 filtered counts on various nested relations and 1 filtered sum on a multiply-nested relation, SQLite3) but cannot be reproduced with the simple models above.这在我的实际设置中是正确的(各种嵌套关系的 11 个过滤计数和多重嵌套关系的 1 个过滤总和,SQLite3),但不能用上面的简单模型重现。 This issue can be tricky because another part of your code could add an annotation to your QuerySet (eg a
Table.order_FOO()
function), leading to the issue.这个问题可能很棘手,因为您的代码的另一部分可能会向您的 QuerySet 添加注释(例如
Table.order_FOO()
函数),从而导致问题。
3) With the same setup, I have anecdotical evidence that subquery-type QuerySets are faster compared to bare-bone annotation QuerySets (in cases where you have only distinct=True
counts, of course). 3)使用相同的设置,我有轶事证据表明子查询类型的 QuerySets 与基本注释 QuerySets 相比更快(当然,在您只有
distinct=True
计数的情况下)。 I could observe this both with local SQLite3 (83 ms vs 260 ms) and hosted PostgreSQL (320 ms vs 540 ms).我可以使用本地 SQLite3(83 毫秒 vs 260 毫秒)和托管 PostgreSQL(320 毫秒 vs 540 毫秒)观察到这一点。
As a result of the above, I will completely avoid using bare-bone annotations in favour of subqueries.由于上述原因,我将完全避免使用有利于子查询的准系统注释。
Based on the excellent answer from @rktavi, I created two helpers classes that simplify the Subquery
/ Count
and Subquery
/ Sum
patterns:基于从@rktavi的出色答卷,我创建了两个助手类,简化了
Subquery
/ Count
和Subquery
/ Sum
模式:
class SubqueryCount(Subquery):
template = "(SELECT count(*) FROM (%(subquery)s) _count)"
output_field = PositiveIntegerField()
class SubquerySum(Subquery):
template = '(SELECT sum(_sum."%(column)s") FROM (%(subquery)s) _sum)'
def __init__(self, queryset, column, output_field=None, **extra):
if output_field is None:
output_field = queryset.model._meta.get_field(column)
super().__init__(queryset, output_field, column=column, **extra)
One can use these helpers like so:可以像这样使用这些助手:
from django.db.models import OuterRef
weapons = Weapon.objects.filter(unit__player_id=OuterRef('id'))
units = Unit.objects.filter(player_id=OuterRef('id'))
qs = Player.objects.annotate(weapon_count=SubqueryCount(weapons),
rarity_sum=SubquerySum(units, 'rarity'))
Thanks @ rktavi for your amazing answer!!感谢@rktavi的精彩回答!!
Here's my use case:这是我的用例:
Using Django DRF.使用 Django DRF。
I needed to get Sum and Count from different FK's inside the annotate so that it would all be part of one queryset in order to add these fields to the ordering_fields in DRF.我需要从注释内的不同 FK 中获取 Sum 和 Count,以便它们都成为一个查询集的一部分,以便将这些字段添加到 DRF 中的 ordering_fields。
The Sum and Count were clashing and returning wrong results. Sum 和 Count 发生冲突并返回错误的结果。 Your answer really helped me put it all together.
你的回答真的帮助我把它放在一起。
The annotate was occasionally returning the dates as strings
, so I needed to Cast it to DateTimeField.注释偶尔会将日期作为
strings
返回,因此我需要将其转换为 DateTimeField。
donation_filter = Q(payments__status='donated') & ~Q(payments__payment_type__payment_type='coupon')
total_donated_SQ = User.objects.annotate(total_donated=Sum('payments__sum', filter=donation_filter )).filter(pk=OuterRef('pk'))
message_count_SQ = User.objects.annotate(message_count=Count('events__id', filter=Q(events__event_id=6))).filter(pk=OuterRef('pk'))
queryset = User.objects.annotate(
total_donated=Subquery(total_donated_SQ.values('total_donated'), output_field=IntegerField()),
last_donation_date=Cast(Max('payments__updated', filter=donation_filter ), output_field=DateTimeField()),
message_count=Subquery(message_count_SQ.values('message_count'), output_field=IntegerField()),
last_message_date=Cast(Max('events__updated', filter=Q(events__event_id=6)), output_field=DateTimeField())
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.