简体   繁体   English

将 Django QuerySet 转换为 pandas DataFrame

[英]Converting Django QuerySet to pandas DataFrame

I am going to convert a Django QuerySet to a pandas DataFrame as follows:我将 Django QuerySet 转换为 pandas DataFrame ,如下所示:

qs = SomeModel.objects.select_related().filter(date__year=2012)
q = qs.values('date', 'OtherField')
df = pd.DataFrame.from_records(q)

It works, but is there a more efficient way?它有效,但有没有更有效的方法?

import pandas as pd
import datetime
from myapp.models import BlogPost

df = pd.DataFrame(list(BlogPost.objects.all().values()))
df = pd.DataFrame(list(BlogPost.objects.filter(date__gte=datetime.datetime(2012, 5, 1)).values()))

# limit which fields
df = pd.DataFrame(list(BlogPost.objects.all().values('author', 'date', 'slug')))

The above is how I do the same thing.以上是我如何做同样的事情。 The most useful addition is specifying which fields you are interested in. If it's only a subset of the available fields you are interested in, then this would give a performance boost I imagine.最有用的补充是指定您感兴趣的字段。如果它只是您感兴趣的可用字段的子集,那么我想这会提高性能。

Convert the queryset on values_list() will be more memory efficient than on values() directly.values_list()上转换查询集将比直接在values()上更节省内存。 Since the method values() returns a queryset of list of dict (key:value pairs), values_list() only returns list of tuple (pure data).由于方法values()返回字典列表(键:值对)的查询集, values_list()仅返回元组列表(纯数据)。 It will save about 50% memory, just need to set the column information when you call pd.DataFrame() .它将节省大约 50% 的内存,只需要在调用pd.DataFrame()时设置列信息。

Method 1:
    queryset = models.xxx.objects.values("A","B","C","D")
    df = pd.DataFrame(list(queryset))  ## consumes much memory
    #df = pd.DataFrame.from_records(queryset) ## works but no much change on memory usage

Method 2:
    queryset = models.xxx.objects.values_list("A","B","C","D")
    df = pd.DataFrame(list(queryset), columns=["A","B","C","D"]) ## this will save 50% memory
    #df = pd.DataFrame.from_records(queryset, columns=["A","B","C","D"]) ##It does not work. Crashed with datatype is queryset not list.

I tested this on my project with >1 million rows data, the peak memory is reduced from 2G to 1G.我在我的项目中使用超过 100 万行数据对此进行了测试,峰值内存从 2G 减少到 1G。

Django Pandas solves this rather neatly: https://github.com/chrisdev/django-pandas/ Django Pandas 巧妙地解决了这个问题: https ://github.com/chrisdev/django-pandas/

From the README:从自述文件:

class MyModel(models.Model):
    full_name = models.CharField(max_length=25)
    age = models.IntegerField()
    department = models.CharField(max_length=3)
    wage = models.FloatField()

from django_pandas.io import read_frame
qs = MyModel.objects.all()
df = read_frame(qs)

From the Django perspective (I'm not familiar with pandas ) this is fine.从 Django 的角度来看(我对pandas不熟悉),这很好。 My only concern is that if you have a very large number of records, you may run into memory problems.我唯一担心的是,如果您有大量记录,您可能会遇到内存问题。 If this were the case, something along the lines of this memory efficient queryset iterator would be necessary.如果是这种情况,那么类似于这种内存高效查询集迭代器的东西将是必要的。 (The snippet as written might require some rewriting to allow for your smart use of .values() ). (所写的代码段可能需要一些重写以允许您巧妙地使用.values() )。

You maybe can use model_to_dict你也许可以使用 model_to_dict

import datetime
from django.forms import model_to_dict
pallobjs = [ model_to_dict(pallobj) for pallobj in PalletsManag.objects.filter(estado='APTO_PARA_VENTA')] 
df = pd.DataFrame(pallobjs)
df.head()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM