Django query with annotation and conditional count too slow

Question

I have this query with annotations, count and conditional expressions, that runs very slow, it takes forever.

I have two models one that stores instagram publications and another one that stores twitter publications. Each publication has also an FK to another model that represents a hexagonal geographical area within a city.

Publications [FK] -> HexCityArea

TwitterPublication [FK] -> HexCityArea

I'm trying to count the publications for each hexagon, but the publications are pre-filtered by other fields like date, so the code is:

instagram_publications_ids = list(instagram_publications.values_list('id', flat=True))
twitter_publications_ids = list(twitter_publications.values_list('id', flat=True))

print "\n[HEXAGONS QUERY]> List of publications ids insta\n %s \n" % instagram_publications.query
print instagram_publications.explain()
print "\n[HEXAGONS QUERY]> List of publications ids twitter\n %s \n" % twitter_publications.query
print twitter_publications.explain()

# Get count of publications by hexagon
resultant_hexagons = HexagonalCityArea.objects.filter(city=city).annotate(
    instagram_count=Count(Case(
        When(publication__id__in=instagram_publications_ids, then=1),
        output_field=IntegerField(),
    ))
).annotate(
    twitter_count=Count(Case(
        When(twitterpublication__id__in=twitter_publications_ids, then=1),
        output_field=IntegerField(),
    ))
)#filter(instagram_count__gt=0).filter(twitter_count__gt=0) # Discard empty hexagons

# For debug only
print "\n[HEXAGONS QUERY]> Count of publications\n %s \n" % resultant_hexagons.query
print resultant_hexagons.explain()

resultant_hexagons_list = list(resultant_hexagons)
# Iterate remaining hexagons
city_hexagons = [h for h in resultant_hexagons_list if h.instagram_count > 0 or h.twitter_count > 0]

As you can see, first I get the list of IDs of selected publications and I use them later to count only those publications.

One problem that I see is that the list of IDs are very long around 28000 elements, but if I don't use the list of IDs I don't get the desired results, the count condition doesn't work properly and all publications of the city are counted.

I've tried this to avoid using list of IDs:

        resultant_hexagons = HexagonalCityArea.objects.filter(city=city).annotate(
            instagram_count=Count(Case(

                When(publication__in=instagram_publications, then=1),
                output_field=IntegerField(),
            ))
        ).annotate(
            twitter_count=Count(Case(

                When(twitterpublication__in=twitter_publications, then=1),
                output_field=IntegerField(),
            ))
        ).filter(instagram_count__gt=0).filter(twitter_count__gt=0) # Discard empty hexagons

        # For debug only
        print "\n[HEXAGONS QUERY]> Count of publications\n %s \n" % resultant_hexagons.query
        print resultant_hexagons.explain()

Here is the generated SQL:

SELECT
   "instanalysis_hexagonalcityarea"."id",
   "instanalysis_hexagonalcityarea"."created",
   "instanalysis_hexagonalcityarea"."modified",
   "instanalysis_hexagonalcityarea"."geom",
   "instanalysis_hexagonalcityarea"."city_id",
   COUNT(
   CASE
      WHEN
         "instanalysis_publication"."id" IN 
         (
            SELECT
               U0."id" 
            FROM
               "instanalysis_publication" U0 
               INNER JOIN
                  "instanalysis_instagramlocation" U1 
                  ON (U0."location_id" = U1."id") 
               INNER JOIN
                  "instanalysis_spot" U2 
                  ON (U1."spot_id" = U2."id") 
               INNER JOIN
                  "instanalysis_city" U3 
                  ON (U2."city_id" = U3."id") 
            WHERE
               (
                  U3."name" = Durban 
                  AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00 
                  AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00
               )
         )
      THEN
         1 
      ELSE
         NULL 
   END
) AS "instagram_count", COUNT(
   CASE
      WHEN
         "instanalysis_twitterpublication"."id" IN 
         (
            SELECT
               U0."id" 
            FROM
               "instanalysis_twitterpublication" U0 
               INNER JOIN
                  "instanalysis_twitterlocation" U1 
                  ON (U0."location_id" = U1."id") 
               INNER JOIN
                  "instanalysis_spot" U2 
                  ON (U1."spot_id" = U2."id") 
               INNER JOIN
                  "instanalysis_city" U3 
                  ON (U2."city_id" = U3."id") 
            WHERE
               (
                  U3."name" = Durban 
                  AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00 
                  AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00
               )
         )
      THEN
         1 
      ELSE
         NULL 
   END
) AS "twitter_count" 
FROM
   "instanalysis_hexagonalcityarea" 
   LEFT OUTER JOIN
      "instanalysis_publication" 
      ON ("instanalysis_hexagonalcityarea"."id" = "instanalysis_publication"."hexagon_id") 
   LEFT OUTER JOIN
      "instanalysis_twitterpublication" 
      ON ("instanalysis_hexagonalcityarea"."id" = "instanalysis_twitterpublication"."hexagon_id") 
WHERE
   "instanalysis_hexagonalcityarea"."city_id" = 7 
GROUP BY
   "instanalysis_hexagonalcityarea"."id" 
HAVING
(COUNT(
   CASE
      WHEN
         "instanalysis_publication"."id" IN 
         (
            SELECT
               U0."id" 
            FROM
               "instanalysis_publication" U0 
               INNER JOIN
                  "instanalysis_instagramlocation" U1 
                  ON (U0."location_id" = U1."id") 
               INNER JOIN
                  "instanalysis_spot" U2 
                  ON (U1."spot_id" = U2."id") 
               INNER JOIN
                  "instanalysis_city" U3 
                  ON (U2."city_id" = U3."id") 
            WHERE
               (
                  U3."name" = Durban 
                  AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00 
                  AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00
               )
         )
      THEN
         1 
      ELSE
         NULL 
   END
) > 0 
   AND COUNT(
   CASE
      WHEN
         "instanalysis_twitterpublication"."id" IN 
         (
            SELECT
               U0."id" 
            FROM
               "instanalysis_twitterpublication" U0 
               INNER JOIN
                  "instanalysis_twitterlocation" U1 
                  ON (U0."location_id" = U1."id") 
               INNER JOIN
                  "instanalysis_spot" U2 
                  ON (U1."spot_id" = U2."id") 
               INNER JOIN
                  "instanalysis_city" U3 
                  ON (U2."city_id" = U3."id") 
            WHERE
               (
                  U3."name" = Durban 
                  AND U0."publication_date" >= 2016 - 12 - 01 00:00:00 + 01:00 
                  AND U0."publication_date" <= 2016 - 12 - 11 00:00:00 + 01:00
               )
         )
      THEN
         1 
      ELSE
         NULL 
   END
) > 0)

This is much faster, see explain analize:

GroupAggregate  (cost=1.14..743590.08 rows=3300 width=184) (actual time=5186.606..46907.530 rows=334 loops=1)
  Group Key: instanalysis_hexagonalcityarea.id
  Filter: ((count(CASE WHEN (hashed SubPlan 3) THEN 1 ELSE NULL::integer END) > 0) AND (count(CASE WHEN (hashed SubPlan 4) THEN 1 ELSE NULL::integer END) > 0))
  Rows Removed by Filter: 2966
  ->  Merge Left Join  (cost=1.14..320194.96 rows=7166797 width=184) (actual time=4851.792..17369.232 rows=70436610 loops=1)
        Merge Cond: (instanalysis_hexagonalcityarea.id = instanalysis_publication.hexagon_id)
        ->  Merge Left Join  (cost=0.71..21686.40 rows=49328 width=180) (actual time=109.033..164.451 rows=30857 loops=1)
              Merge Cond: (instanalysis_hexagonalcityarea.id = instanalysis_twitterpublication.hexagon_id)
              ->  Index Scan using instanalysis_hexagonalcityarea_pkey on instanalysis_hexagonalcityarea  (cost=0.29..591.47 rows=3300 width=176) (actual time=22.783..23.878 rows=3300 loops=1)
                    Filter: (city_id = 7)
                    Rows Removed by Filter: 7282
              ->  Index Scan using instanalysis_twitterpublication_5c78aecb on instanalysis_twitterpublication  (cost=0.42..64392.25 rows=504291 width=8) (actual time=0.018..111.677 rows=170305 loops=1)
        ->  Materialize  (cost=0.43..501402.61 rows=3754731 width=8) (actual time=0.011..6788.670 rows=71922153 loops=1)
              ->  Index Scan using instanalysis_publication_5c78aecb on instanalysis_publication  (cost=0.43..492015.78 rows=3754731 width=8) (actual time=0.005..4034.838 rows=1778030 loops=1)
  SubPlan 1
    ->  Nested Loop  (cost=0.72..105061.24 rows=27624 width=4) (actual time=0.326..74.024 rows=21824 loops=1)
          ->  Nested Loop  (cost=0.29..620.11 rows=2767 width=4) (actual time=0.024..2.915 rows=3374 loops=1)
                ->  Nested Loop  (cost=0.00..143.13 rows=504 width=4) (actual time=0.016..0.618 rows=829 loops=1)
                      Join Filter: (u2.city_id = u3.id)
                      Rows Removed by Join Filter: 3350
                      ->  Seq Scan on instanalysis_city u3  (cost=0.00..1.10 rows=1 width=4) (actual time=0.004..0.006 rows=1 loops=1)
                            Filter: ((name)::text = 'Durban'::text)
                            Rows Removed by Filter: 7
                      ->  Seq Scan on instanalysis_spot u2  (cost=0.00..89.79 rows=4179 width=8) (actual time=0.001..0.242 rows=4179 loops=1)
                ->  Index Scan using instanalysis_instagramlocation_e72b53d4 on instanalysis_instagramlocation u1  (cost=0.29..0.89 rows=6 width=8) (actual time=0.001..0.002 rows=4 loops=829)
                      Index Cond: (spot_id = u2.id)
          ->  Index Scan using instanalysis_publication_e274a5da on instanalysis_publication u0  (cost=0.43..37.45 rows=30 width=8) (actual time=0.006..0.021 rows=6 loops=3374)
                Index Cond: (location_id = u1.id)
                Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone))
                Rows Removed by Filter: 80
  SubPlan 2
    ->  Hash Join  (cost=2595.62..25893.51 rows=9013 width=4) (actual time=22.511..73.141 rows=6220 loops=1)
          Hash Cond: (u0_1.location_id = u1_1.id)
          ->  Seq Scan on instanalysis_twitterpublication u0_1  (cost=0.00..22927.36 rows=74772 width=8) (actual time=15.212..59.628 rows=75775 loops=1)
                Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone))
                Rows Removed by Filter: 428516
          ->  Hash  (cost=2348.24..2348.24 rows=19790 width=4) (actual time=6.538..6.538 rows=15589 loops=1)
                Buckets: 32768  Batches: 1  Memory Usage: 805kB
                ->  Nested Loop  (cost=0.70..2348.24 rows=19790 width=4) (actual time=0.023..5.052 rows=15589 loops=1)
                      ->  Nested Loop  (cost=0.28..39.28 rows=504 width=4) (actual time=0.015..0.186 rows=829 loops=1)
                            ->  Seq Scan on instanalysis_city u3_1  (cost=0.00..1.10 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=1)
                                  Filter: ((name)::text = 'Durban'::text)
                                  Rows Removed by Filter: 7
                            ->  Index Scan using instanalysis_spot_c7141997 on instanalysis_spot u2_1  (cost=0.28..33.14 rows=504 width=8) (actual time=0.010..0.124 rows=829 loops=1)
                                  Index Cond: (city_id = u3_1.id)
                      ->  Index Scan using instanalysis_twitterlocation_e72b53d4 on instanalysis_twitterlocation u1_1  (cost=0.42..3.93 rows=65 width=8) (actual time=0.001..0.004 rows=19 loops=829)
                            Index Cond: (spot_id = u2_1.id)
  SubPlan 3
    ->  Nested Loop  (cost=0.72..105061.24 rows=27624 width=4) (actual time=0.348..80.863 rows=21824 loops=1)
          ->  Nested Loop  (cost=0.29..620.11 rows=2767 width=4) (actual time=0.028..3.507 rows=3374 loops=1)
                ->  Nested Loop  (cost=0.00..143.13 rows=504 width=4) (actual time=0.016..0.646 rows=829 loops=1)
                      Join Filter: (u2_2.city_id = u3_2.id)
                      Rows Removed by Join Filter: 3350
                      ->  Seq Scan on instanalysis_city u3_2  (cost=0.00..1.10 rows=1 width=4) (actual time=0.003..0.004 rows=1 loops=1)
                            Filter: ((name)::text = 'Durban'::text)
                            Rows Removed by Filter: 7
                      ->  Seq Scan on instanalysis_spot u2_2  (cost=0.00..89.79 rows=4179 width=8) (actual time=0.001..0.276 rows=4179 loops=1)
                ->  Index Scan using instanalysis_instagramlocation_e72b53d4 on instanalysis_instagramlocation u1_2  (cost=0.29..0.89 rows=6 width=8) (actual time=0.001..0.003 rows=4 loops=829)
                      Index Cond: (spot_id = u2_2.id)
          ->  Index Scan using instanalysis_publication_e274a5da on instanalysis_publication u0_2  (cost=0.43..37.45 rows=30 width=8) (actual time=0.007..0.022 rows=6 loops=3374)
                Index Cond: (location_id = u1_2.id)
                Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone))
                Rows Removed by Filter: 80
  SubPlan 4
    ->  Hash Join  (cost=2595.62..25893.51 rows=9013 width=4) (actual time=41.392..92.680 rows=6220 loops=1)
          Hash Cond: (u0_3.location_id = u1_3.id)
          ->  Seq Scan on instanalysis_twitterpublication u0_3  (cost=0.00..22927.36 rows=74772 width=8) (actual time=32.641..78.020 rows=75775 loops=1)
                Filter: ((publication_date >= '2016-11-30 23:00:00+00'::timestamp with time zone) AND (publication_date <= '2016-12-10 23:00:00+00'::timestamp with time zone))
                Rows Removed by Filter: 428516
          ->  Hash  (cost=2348.24..2348.24 rows=19790 width=4) (actual time=7.907..7.907 rows=15589 loops=1)
                Buckets: 32768  Batches: 1  Memory Usage: 805kB
                ->  Nested Loop  (cost=0.70..2348.24 rows=19790 width=4) (actual time=0.044..6.136 rows=15589 loops=1)
                      ->  Nested Loop  (cost=0.28..39.28 rows=504 width=4) (actual time=0.026..0.220 rows=829 loops=1)
                            ->  Seq Scan on instanalysis_city u3_3  (cost=0.00..1.10 rows=1 width=4) (actual time=0.006..0.008 rows=1 loops=1)
                                  Filter: ((name)::text = 'Durban'::text)
                                  Rows Removed by Filter: 7
                            ->  Index Scan using instanalysis_spot_c7141997 on instanalysis_spot u2_3  (cost=0.28..33.14 rows=504 width=8) (actual time=0.016..0.135 rows=829 loops=1)
                                  Index Cond: (city_id = u3_3.id)
                      ->  Index Scan using instanalysis_twitterlocation_e72b53d4 on instanalysis_twitterlocation u1_3  (cost=0.42..3.93 rows=65 width=8) (actual time=0.001..0.005 rows=19 loops=829)
                            Index Cond: (spot_id = u2_3.id)
Planning time: 50.735 ms
Execution time: 46908.482 ms

The problem is that I don't get what I want, It seems to be counting more publications. The publicatons are filtered previously by date, and I want to count only how many of those filtered publications are in each hexagon, but it seems to be counting all the publications by hexagon, like if the When clause wasn't working.

Thanks for your help.

Answer 1

The biggest reason it is slow is subqueruies, ie for every record in hexarea DB server issues another query to count instagram/twitter records matching its ids. Even after update, it is still doing essentially the same.

How to fix it: run aggregate query. This way, DB server can run linearly trough the list of ids only once, which is probably an order of magnitudes more efficient. Example:

from django.db.models import Count

# assuming "instagram_publications" is the related_name
# of the correspondent Instagram/Twitter post model 
instacounts = HexagonalCityArea.objects.filter(city=city
             ).filter(instagram_publications__publicat‌ion_date__lte=end_da‌te
             ).filter(instagram_publications__publicat‌ion_date__gte=start_da‌te
             ).aggregate(Count('instagram_publications')))

Django query with annotation and conditional count too slow

Question

1 answers

solution1
0 2017-03-25 15:38:10

Django query with annotation and conditional count too slow

Question

1 answers

solution1 0 2017-03-25 15:38:10

solution1
0 2017-03-25 15:38:10