百万行的 Django 表

Question

I have a project with 2 applications ( books and reader ).我有一个包含 2 个应用程序（书籍和阅读器）的项目。

Books application has a table with 4 milions of rows with this fields: Books 应用程序有一个包含 400 万行的表，其中包含以下字段：

 book_title = models.CharField(max_length=40)
 book_description = models.CharField(max_length=400)

To avoid to query the database with 4 milions of rows, I am thinking to divide it by subject ( 20 models with 20 tables with 200.000 rows ( book_horror, book_drammatic, ecc ).为了避免用 400 万行查询数据库，我正在考虑按主题划分它（20 个模型，20 个表，200.000 行（ book_horror，book_drammatic，ecc）。

In "reader" application, I am thinking to insert this fields:在“阅读器”应用程序中，我正在考虑插入以下字段：

reader_name = models.CharField(max_length=20, blank=True)
book_subject = models.IntegerField()
book_id = models.IntegerField()

So instead of ForeignKey, I am thinking to use a integer "book_subject" (which allows to access the appropriate table) and "book_id" (which allows to access the book in the table specified in "book_subject").因此，我正在考虑使用 integer“book_subject”（允许访问适当的表）和“book_id”（允许访问“book_subject”中指定的表中的书）而不是 ForeignKey。

Is a good solution to avoid to query a table with 4 milions of rows?避免查询具有 400 万行的表是一个很好的解决方案吗？

Is there an alternative solution?有替代解决方案吗？

Thanks ^__^谢谢^__^

Answer 1

Like many have said, it's a bit premature to split your table up into smaller tables (horizontal partitioning or even sharding). 像许多人所说的那样，将表分成较小的表（水平分区甚至分片）还为时过早。 Databases are made to handle tables of this size, so your performance problem is probably somewhere else. 数据库用于处理此大小的表，因此您的性能问题可能在其他地方。

Indexes are the first step, it sounds like you've done this though. 索引是第一步，听起来你已经这样做了。 4 million rows should be ok for the db to handle with an index. 数据库可以使用索引处理400万行。

Second, check the number of queries you're running. 其次，检查您正在运行的查询数。 You can do this with something like the django debug toolbar, and you'll often be surprised how many unnecessary queries are being made. 您可以使用django调试工具栏之类的东西来执行此操作，并且您经常会惊讶地发现了多少不必要的查询。

Caching is the next step, use memcached for pages or parts of pages that are unchanged for most users. 缓存是下一步，对于大多数用户未更改的页面或页面部分使用memcached。 This is where you will see your biggest performance boost for the little effort required. 在这里，您将看到您所需的最小努力所带来的最大性能提升。

If you really, really need to split up the tables, the latest version of django (1.2 alpha) can handle sharding (eg multi-db), and you should be able to hand write a horizontal partitioning solution (postgres offers an in-db way to do this). 如果你真的，真的需要拆分表，最新版本的django（1.2 alpha）可以处理分片（例如multi-db），你应该能够手写一个水平分区解决方案（postgres提供in-db这样做的方法）。 Please don't use genre to split the tables! 请不要使用流派来分割表格！ pick something that you wont ever, ever change and that you'll always know when making a query. 选择一些你永远不会改变的东西，并且你在查询时总会知道。 Like author and divide by first letter of the surname or something. 像作者一样，除以姓氏的第一个字母或其他东西。 This is a lot of effort and has a number of drawbacks for a database which isn't particularly big --- this is why most people here are advising against it! 这是一项很大的努力，并且对于数据库来说有很多缺点，这个数据并不是特别大 - 这就是为什么这里的大多数人都在反对它的建议！

[edit] [编辑]

I left out denormalisation! 我遗漏了非规范化！ Put common counts, sums etc in the eg author table to prevent joins on common queries. 在例如作者表中放置常用计数，总和等，以防止对常见查询进行连接。 The downside is that you have to maintain it yourself (until django adds a DenormalizedField). 缺点是你必须自己维护它（直到django添加一个DenormalizedField）。 I would look at this during development for clear, straightforward cases or after caching has failed you --- but well before sharding or horizontal partitioning. 我想看看这个发展过程中明确，直接的情况下或在缓存失败你---但分片或水平分区之前。

Answer 2

ForeignKey is implemented as IntegerField in the database, so you save little to nothing at the cost of crippling your model. ForeignKey在数据库中实现为IntegerField ，因此您可以以牺牲模型为代价来节省很少的成本。

Edit: And for pete's sake, keep it in one table and use indexes as appropriate. 编辑：为了皮特的缘故，将它保存在一个表中并根据需要使用索引。

Answer 3

Are you having performance problems? 你有性能问题吗？ If so, you might need to add a few indexes . 如果是这样，您可能需要添加一些索引。

One way to get an idea where an index would help is by looking at your db server's query log ( instructions here if you're on MySQL). 了解索引有用的一种方法是查看数据库服务器的查询日志（如果您使用的是MySQL，请参阅此处的说明）。

If you're not having performance problems, then just go with it. 如果你没有性能问题，那就去吧。 Databases are made to handle millions of records, and django is pretty good at generating sensible queries. 数据库用于处理数百万条记录，而django非常擅长生成合理的查询。

Answer 4

A common approach to this type of problem is Sharding . 解决此类问题的常见方法是Sharding 。 Unfortunately it's mostly up to the ORM to implement it (Hibernate does it wonderfully) and Django does not support this. 不幸的是，主要由ORM来实现它（Hibernate做得非常好）而且Django不支持这个。 However, I'm not sure 4 million rows is really all that bad. 但是，我不确定400万行真的那么糟糕。 Your queries should still be entirely manageable. 您的查询仍应完全可管理。

Perhaps you should look in to caching with something like memcached . 也许您应该使用memcached之类的东西来查看缓存。 Django supports this quite well. Django非常支持这一点。

Answer 5

You haven't mentioned which database you're using. 您还没有提到您正在使用的数据库。 Some databases - like MySQL and PostgreSQL - have extremely conservative settings out-of-the-box, which are basically unusable for anything except tiny databases on tiny servers. 一些数据库 - 比如MySQL和PostgreSQL--具有非常保守的开箱即用设置，除了小型服务器上的小型数据库外，基本上无法使用。

If you tell us which database you're using, and what hardware it's running on, and whether that hardware is shared with other applications (is it also serving the web application, for example) then we may be able to give you some specific tuning advice. 如果您告诉我们您正在使用哪个数据库，以及它正在运行的硬件，以及该硬件是否与其他应用程序共享（例如，它是否也在为Web应用程序提供服务），那么我们可能会为您提供一些特定的调整咨询。

For example, with MySQL, you will probably need to tune the InnoDB settings; 例如，使用MySQL，您可能需要调整InnoDB设置; for PostgreSQL, you'll need to alter shared_buffers and a number of other settings. 对于PostgreSQL，您需要更改shared_buffers和许多其他设置。

Answer 6

I'm not familiar with Django, but I have a general understanding of DB. 我不熟悉Django，但我对DB有一个大致的了解。

When you have large databases, it's pretty normal to index your database . 当您拥有大型数据库时，索引数据库是很正常的。 That way, retrieving data, should be pretty quick. 这样，检索数据应该非常快。

When it comes to associate a book with a reader, you should create another table, that links reader to books. 在将书籍与读者联系起来时，您应该创建另一个表格，将读者与书籍联系起来。

It's not a bad idea to divide the books into subjects. 将书籍分成科目并不是一个坏主意。 But I'm not sure what you mean by having 20 applications. 但我不确定你有20个申请是什么意思。

Answer 7

You can use a server-side datatable.您可以使用服务器端数据表。 If you can implement a server-side datatable, you will be able to have more than 4 million records in less than a second.如果您可以实现服务器端数据表，您将能够在不到一秒的时间内拥有超过 400 万条记录。

百万行的 Django 表

问题描述

7 个解决方案

解决方案1
12 已采纳 2010-01-12 20:20:11

解决方案2
10 2010-01-12 18:50:00

解决方案3
1 2010-01-12 19:10:05

解决方案4
1 2010-01-12 19:16:38

解决方案5
1 2012-03-05 11:35:02

解决方案6
0 2010-01-12 19:02:51

解决方案7
0 2022-09-19 06:41:35

百万行的 Django 表

问题描述

7 个解决方案

解决方案1 12 已采纳 2010-01-12 20:20:11

解决方案2 10 2010-01-12 18:50:00

解决方案3 1 2010-01-12 19:10:05

解决方案4 1 2010-01-12 19:16:38

解决方案5 1 2012-03-05 11:35:02

解决方案6 0 2010-01-12 19:02:51

解决方案7 0 2022-09-19 06:41:35

解决方案1
12 已采纳 2010-01-12 20:20:11

解决方案2
10 2010-01-12 18:50:00

解决方案3
1 2010-01-12 19:10:05

解决方案4
1 2010-01-12 19:16:38

解决方案5
1 2012-03-05 11:35:02

解决方案6
0 2010-01-12 19:02:51

解决方案7
0 2022-09-19 06:41:35