Django 过滤大型数据集耗时太长

Question

I know there have been a lot of questions regarding this subject, but I couldn't find any of my case.我知道关于这个主题有很多问题，但我找不到我的任何案例。 My problem is rather simple.我的问题很简单。

In my influencer app, I have Note model which contains about 30 fields:在我的influencer者应用程序中，我Note model 包含大约 30 个字段：

class Note(models.Model):
    desc = models.TextField()
    likeCount = models.PositiveIntegerField()
    commentCount = models.PositiveIntegerField()
    ...
    tags = models.ManyToManyField(Tag)
    postTs = models.DateTimeField(null=True)

And there is more than 1 million Note s in my PostgreSQL database hosted by AWS RDS.而我的 AWS RDS 托管的PostgreSQL数据库中有超过100 万条Note 。

Now when I execute the following code:现在，当我执行以下代码时：

notes = (
    Note.objects.filter(desc__icontains='some word')
                .values("likeCount", "collectCount", "shareCount", "commentCount", "postTs")[:10]
)
print(len(notes))  # Output: 10

it takes around 7 seconds .大约需要7 秒。

The resulting SQL query is:生成的 SQL 查询为：

SELECT "influencer_note"."likeCount",
       "influencer_note"."collectCount",
       "influencer_note"."shareCount",
       "influencer_note"."commentCount",
       "influencer_note"."postTs"
  FROM "influencer_note"
 WHERE UPPER("influencer_note"."desc"::text) LIKE UPPER('%some word%')
 LIMIT 10

I think I have done pretty much everything to optimize the query (such as selecting the only necessary fields and limiting the number of data -- 10 is obviously a small number ), but it's still taking abnormal amount of time.我想我已经做了很多事情来优化查询（例如选择唯一必要的字段并限制数据的数量——10 显然是一个小数字），但它仍然需要异常多的时间。

What are the possible causes for this problem and how can I further optimize this?这个问题的可能原因是什么，我该如何进一步优化？

Ultimately, I need to make a chart with the filtered queryset, which is why I need a solution other than pagination or LIMIT .最终，我需要使用过滤后的查询集制作图表，这就是为什么我需要除 pagination 或LIMIT之外的解决方案。

Thank you in advance.先感谢您。

Answer 1

It's very likely that icontains (LIKE UPPER) here is an expensive operation that is taking most of query evaluation time.很可能这里的icontains (LIKE UPPER) 是一项耗费大部分查询评估时间的昂贵操作。 Not sure if you can do much more optimizations with Django ORM, but probably you can try some of the approaches of full text search with using PostgreSQL vector search.不确定您是否可以使用 Django ORM 进行更多优化，但也许您可以尝试使用 PostgreSQL 矢量搜索的一些全文搜索方法。

Another option is to use more suitable tool like ElasticSearch.另一种选择是使用更合适的工具，例如 ElasticSearch。 You can read some entrance guide here .你可以在这里阅读一些入口指南。

Answer 2

The query you show could be accelerated by the index:您显示的查询可以通过索引加速：

create extension pg_trgm;
create index on influencer_note using gin (UPPER("desc"::text) gin_trgm_ops)

Although why Django injects the UPPER calls in the query here, rather than just doing the sensible thing of using ILIKE, is a mystery to me.虽然为什么 Django 在这里的查询中注入 UPPER 调用，而不是仅仅做使用 ILIKE 的明智之举，但对我来说还是个谜。

Django 过滤大型数据集耗时太长

问题描述

2 个解决方案

解决方案1
2 2020-07-02 13:06:08

解决方案2
1 2020-07-02 13:39:41

Django 过滤大型数据集耗时太长

问题描述

2 个解决方案

解决方案1 2 2020-07-02 13:06:08

解决方案2 1 2020-07-02 13:39:41

解决方案1
2 2020-07-02 13:06:08

解决方案2
1 2020-07-02 13:39:41