简体   繁体   English

一个大型数据库还是几个较小的数据库?

[英]A large database or several smaller?

After reading a lot about it and similar questions, I am still not clear about the following case.在阅读了很多关于它和类似的问题之后,我仍然不清楚以下案例。

I have an schema like this in one mysql database, where I store the probabilities of matches of more than 10 sports depending on the type of result (is intended for an application that shows the odds for each sport on different pages but I will never mix sports on the same page):我在一个 mysql 数据库中有这样的模式,我根据结果类型存储超过 10 项运动的比赛概率(用于在不同页面上显示每种运动的赔率的应用程序,但我永远不会混合体育在同一页):

Design 1: a single database设计一:单一数据库

SPORT
id
name

TEAM
id
sportId
name
birth

MATCHES
id
sportId
teamId_1
teamId_2
result
date

PROBABILITIES
id
matchId
type
percentage

(table Probabilities is very long, almost a billion rows, and will grow over time) (表概率很长,几乎有十亿行,并且会随着时间的推移而增长)

All necessary fields are correctly indexed.所有必要的字段都被正确索引。 Then to see all the probabilities of the matches that do not have result with the football sport with id = 1 , I would make the following query:然后要查看与id = 1的足球运动没有结果的比赛的所有概率,我将进行以下查询:

SELECT s.name, t1.name as nameTeam1, t2.name as nameTeam2, t1.birth as birthTeam1, t2.birth as birthTeam2, m.date, p.type, p.percentage
FROM matches m
INNER JOIN team t1 ON t1.id = m.teamId_1
INNER JOIN team t2 ON t2.id = m.teamId_2
INNER JOIN sport s ON s.id = m.sportId
INNER JOIN probabilities p ON p.matchId = m.id
WHERE result IS NULL
AND s.id = 1

This database design is great because it allows me to work comfortably with ORM like Prisma.这个数据库设计很棒,因为它让我可以像 Prisma 一样舒适地使用 ORM。 But for my team the most important thing is speed and performance .但对我的团队来说,最重要的是速度和性能

Knowing this, is it a good idea to do it this way or would it be better to separate the tables into several databases?知道了这一点,这样做是个好主意还是将表分成几个数据库更好?

Design 2: one database per sport设计 2:每项运动一个数据库

Database Football数据库足球

TEAM
id
sportId
name
birth

MATCHES
id
teamId_1
teamId_2
date

PROBABILITIES
id
matchId
type
percentage

Database Basketball数据库篮球

TEAM
id
sportId
name
birth

MATCHES
id
teamId_1
teamId_2
date

PROBABILITIES
id
matchId
type
percentage

The probabilities table is much smaller, in some sports only thousands of rows. probabilities表要小得多,在某些运动中只有数千行。

So if, for example, I only need to take the football probabilities I make a query like this:因此,例如,如果我只需要获取足球概率,我会进行如下查询:

SELECT t1.name as nameTeam1, t2.name as nameTeam2, t1.birth as birthTeam1, t2.birth as birthTeam2, m.date, p.type, p.percentage
FROM football.matches m
INNER JOIN football.team t1 ON t1.id = m.teamId_1
INNER JOIN football.team t2 ON t2.id = m.teamId_2
INNER JOIN football.probabilities p ON p.matchId = m.id
WHERE result IS NULL

Or is there some other way to improve the speed and performance of the database such as partitioning the probabilities table when we only query the most recent rows in the database?或者当我们只查询数据库中最近的行时,是否有其他方法可以提高数据库的速度和性能,例如对probabilities表进行分区?

If you make one database per sport you are locking the application into that decision.如果您为每项运动创建一个数据库,则您将应用程序锁定在该决定中。 If you put them all together in one you can separate them later if necessary.如果您将它们全部放在一起,则可以在以后根据需要将它们分开。 I doubt it will be.我怀疑它会是。

But for my team the most important thing is speed and performance.但对我的团队来说,最重要的是速度和性能。

At this early stage the most important thing is getting something working so you can use it and discover what it actually needs to do.在这个早期阶段,最重要的事情是让某些东西正常工作,这样您就可以使用它并发现它实际需要做什么。 Then adapt the schema as you learn..然后根据您的学习调整架构..

Your major performance problems won't come from whether you have one database or many, but more pedestrian issues of indexing, bad queries, and schema design.您的主要性能问题不会来自您是否拥有一个或多个数据库,而是更多的索引、错误查询和模式设计等问题。

To that end...为此...

  • Keep the schema simple保持架构简单
  • Keep the schema flexible保持架构灵活
  • Consider a data warehouse考虑一个数据仓库

To the first, that means one database.首先,这意味着一个数据库。 Don't add the complication of maintaining multiple copies of the schema if you don't need to.如果不需要,不要添加维护多个模式副本的复杂性。

To the second, use schema migrations and keep the details of the schema out of the application code.其次,使用模式迁移并将模式的细节保留在应用程序代码之外。 An ORM is a good start, but also employ the Respository Pattern, Decorator Pattern, Service Pattern, and others to keep details of your tables from leaking out into your code. ORM 是一个好的开始,但也使用存储库模式、装饰器模式、服务模式等来防止表的详细信息泄漏到代码中。 Then when it inevitably comes time to change your schema you can without having to rewrite all the code which uses it.然后,当不可避免地需要更改您的架构时,您无需重写所有使用它的代码。

Your concerns can be solved with indexing and partitioning, probably partition probabilities, but without knowing your queries I can't say on what.您的问题可以通过索引和分区(可能是分区概率)来解决,但在不知道您的查询的情况下,我不能说什么。 For example, you might want to partition by the age of the match since newer matches are more interesting than old ones.例如,您可能希望按匹配的年龄进行分区,因为新匹配比旧匹配更有趣。 It's hard to say.很难说。 Fortunately partitioning can be added later.幸运的是,稍后可以添加分区。

The rest of the tables should be relatively small, partitioning by team isn't likely to help.表的 rest 应该相对较小,按团队分区不太可能有帮助。 Poor partitioning choices can even slow things down.糟糕的分区选择甚至会减慢速度。

Finally, what might be best for performance is to separate the statsistical tables into a data warehouse optimized for big data and statistics.最后,可能最好的性能是将统计表分离到针对大数据和统计优化的数据仓库中。 Do the stats there and have the application query them.在那里进行统计并让应用程序查询它们。 This separates the runtime schema which must have low latency and benefits from being kept small, from the statistical database which is mostly reporting on pre-calculated statisitical queries.这将必须具有低延迟和保持小优势的运行时模式与主要报告预先计算的统计查询的统计数据库分开。


Some notes on your schema.关于您的架构的一些注释。

Remove "sport" from the matches.从比赛中删除“运动”。 It's redundant.这是多余的。 Get it from the teams.从团队中获取。 Add a constraint to ensure both teams are playing the same sport.添加约束以确保两支球队都在进行相同的运动。

Don't name a column date .不要命名列date First, it's a keyword.首先,它是一个关键字。 Second, date of what?第二,什么日期? What if there's another date associated with the match?如果有另一个与比赛相关的日期怎么办? Third, what about the time of the match?三、比赛时间呢? Make it specific: scheduled_at .使其具体: scheduled_at Use a timestamp type.使用时间戳类型。

Result should be it's own table. Result应该是它自己的表。 You're going to want to store a lot of information about the result of the match.你会想要存储很多关于比赛结果的信息。

In MySQL, a "DATABASE" is a very lightweight thing.在 MySQL 中,“DATABASE”是一个非常轻量级的东西。 It makes virtually no difference to MySQL and queries as to whether you have one db or two.它与 MySQL 几乎没有区别,并询问您是否有一个或两个 db。 Or 20.或 20。

You might need a little bit of syntax to handle JOINs:您可能需要一些语法来处理 JOIN:

One db:一分贝:

USE db;
SELECT a.x, b.y
    FROM a
    JOIN b ON ...;

Two dbs:两个数据库:

USE db;
SELECT a.x, b.y
    FROM db1.a AS a
    JOIN db2.b AS b  ON ...;

The performance of those two is the same.这两者的表现是一样的。

Bottom Line: Do what feels good to you, the developer.底线:做你觉得好的事情,开发者。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM