简体   繁体   English

如何在不评估的情况下拆分 Django 查询集?

[英]How do you split a Django queryset without evaluating it?

I am dealing with a Queryset of over 5 million + items (For batch ML purposes) and I need to split the queryset (so I can perform multithreading operations) without evaluating the queryset as I only ever need to access each item in the queryset once and thus I don't want to cache the queryset items which evaluating causes.我正在处理超过 500 万个项目的查询集(用于批处理 ML 目的),我需要拆分查询集(以便我可以执行多线程操作)而不评估查询集,因为我只需要访问查询集中的每个项目一次因此我不想缓存评估原因的查询集项目。

Is it possible to select the items into one queryset and split this without evaluating?是否可以将项目选择到一个查询集中并在不评估的情况下将其拆分? or am I going to have to approach it by querying for multiple querysets using Limits [:size] to achieve this behaviour?还是我将不得不通过使用 Limits [:size] 查询多个查询集来实现这种行为?

NB: I am aware that an Iterable can be used to cycle through a queryset without evaluating it but my question is related to how I can I split a queryset (if possible) to then run an iterable on each of the splitted querysets.注意:我知道 Iterable 可用于循环遍历查询集而不对其进行评估,但我的问题与我如何拆分查询集(如果可能)然后在每个拆分的查询集上运行可迭代有关。

Django provides a few classes that help you manage paginated data – that is, data that's split across several pages, with “Previous/Next” links: Django 提供了一些类来帮助您管理分页数据——也就是说,数据被分成几个页面,带有“上一页/下一页”链接:

from django.core.paginator import Paginator

object_list = MyModel.objects.all()
paginator = Paginator(object_list, 10) # Show 10 objects per page, you can choose any other value

for i in paginator.page_range(): # A 1-based range iterator of page numbers, e.g. yielding [1, 2, 3, 4].
    data = iter(paginator.get_page(i))
    # use data

If your django version is 1.11 or less than that like 1.10 , 1.9 or so on, then use paginator.page(page_no) but be careful that this may raise an InvalidPage Exception when invalid/no page has been found.如果您的 django 版本是1.11或低于1.101.9等版本,则使用paginator.page(page_no)但请注意,当发现无效/未找到页面时,这可能会引发 InvalidPage 异常。

For versions <= 1.11 , use below code:对于版本<= 1.11 ,使用以下代码:

from django.core.paginator import Paginator

qs = MyModel.objects.all()
paginator = Paginator(qs, 20)

for page_no in paginator.page_range:
    current_page = paginator.page(page_no)
    current_qs = current_page.object_list

And if you're using django version >= 2.0, please use paginator.get_page(page_no) instead, but you can also use paginator.page(page_no) .如果您使用的是 django 版本 >= 2.0,请改用paginator.get_page(page_no) ,但您也可以使用paginator.page(page_no)

For versions >= 2.0, use below code:对于>= 2.0 的版本,请使用以下代码:

from django.core.paginator import Paginator

qs = MyModel.objects.all()
paginator = Paginator(qs, 20)

for page_no in paginator.page_range:
    current_page = paginator.get_page(page_no)
    current_qs = current_page.object_list

The advantage of using paginator.get_page(page_no) according to django documentations is as follows:根据django文档使用paginator.get_page(page_no)的好处如下:

Return a valid page, even if the page argument isn't a number or isn't in range.返回有效页面,即使页面参数不是数字或不在范围内。

While in the case of paginator.page(page_no) , you have to handle the exception manually if page_no is not a number or is out of range.而在paginator.page(page_no)的情况下,如果 page_no 不是数字或超出范围,则必须手动处理异常。

Passing query sets to Threads is not something I would recommend.将查询集传递给 Threads 不是我推荐的。 I know the sort of thing you are trying to do and why, but its best to just pass some sort of param set to each thread and then have the Thread perform the partial query.我知道您要尝试做的事情以及原因,但最好将某种参数集传递给每个线程,然后让线程执行部分查询。 Working this way, your threads are distinct from the calling code.以这种方式工作,您的线程与调用代码是不同的。

On a different note, if you are trying to use threads as a work around for the lags caused by high DB queries, you might find using transaction management a better route.另一方面,如果您尝试使用线程来解决由高数据库查询引起的延迟,您可能会发现使用事务管理是一个更好的方法。 This link link has some useful tips.此链接链接有一些有用的提示。 I use this instead of Threads我用这个代替线程

Yes you can, as from thisgist是的,你可以,从这个要点开始

Per the updated answer:根据更新的答案:

def queryset_iterator(queryset, chunk_size=1000):
"""
Iterate over a Django Queryset ordered by the primary key
This method loads a maximum of chunk_size (default: 1000) rows in it's
memory at the same time while django normally would load all rows in it's
memory. Using the iterator() method only causes it to not preload all the
classes.
Note that the implementation of the iterator does not support ordered query sets.
"""
    try:
        last_pk = queryset.order_by('-pk')[:1].get().pk
    except ObjectDoesNotExist:
        return

    pk = 0
    queryset = queryset.order_by('pk')
    while pk < last_pk:
        for row in queryset.filter(pk__gt=pk)[:chunk_size]:
            pk = row.pk
            yield row
        gc.collect()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM