简体繁体 English

在关系数据库中存储科学数据

[英]Storing Scientific Data in a Relational Database

原文 2011-03-15 16:58:36 6 4 sql/ database/ storage

I want to store hierarchical, two-dimensional scientific datasets in a relational database (MySQL or SQLite). 我想在关系数据库（MySQL或SQLite）中存储分层的二维科学数据集。 Each dataset contains a table of numerical data with an arbitrary number of columns. 每个数据集包含一个具有任意列数的数值数据表。 In addition, each dataset can have one or more children of the same type associated with a given row of its table. 此外，每个数据集可以有一个或多个与其表的给定行关联的相同类型的子项。 Each dataset typically has between 1 and 100 columns and between 1 and 1.000.000 rows. 每个数据集通常具有1到100列以及1到1.000.000行。 The database should be able to handle many datasets (>1000) and reading/writing of data should be reasonably fast. 数据库应该能够处理许多数据集（> 1000），并且读取/写入数据应该相当快。

What would the best DB schema to store such kind of data? 存储此类数据的最佳数据库架构是什么？ Is it reasonable to have a "master" table with the names, IDs and relations of individual datasets and in addition one table per dataset which contains the numerical values? 拥有一个包含各个数据集的名称，ID和关系的“主”表是否合理，另外每个数据集包含一个包含数值的表？

4 个解决方案

Is it reasonable to have a "master" table with the names, IDs and relations of individual datasets and in addition one table per dataset which contains the numerical values? 拥有一个包含各个数据集的名称，ID和关系的“主”表是否合理，另外每个数据集包含一个包含数值的表？

That's how I'd do it. 我就是这样做的。

I'm not exactly sure how the 'arbitrary columns' thing is working, because data usually doesn't work like that. 我不确定'任意列'是如何工作的，因为数据通常不会那样工作。 Regardless, it sounds like storing it as row,col,val might work nicely. 无论如何，它听起来像存储行，col，val可能很好地工作。

Honestly though, if you don't need to search through it (max, min, etc.), it might be better to use some kind of flat file. 老实说，如果你不需要搜索它（max，min等），那么使用某种平面文件可能会更好。

An alternative setup that might be interesting is using SQLite, with a separate database file for each dataset, plus one master one. 另一种可能有趣的设置是使用SQLite，每个数据集都有一个单独的数据库文件，另外还有一个主数据库文件。

Whatever you pick, how well it will work really depends on what you're going to do with the data. 无论你选择什么，它的工作效果取决于你将如何处理数据。

You're going to end up trading off flexibility for performance, I think. 我认为，你最终会牺牲性能的灵活性。 You can hard-code your db schema, which it sounds like you want to avoid, but would give you the best performance, or 您可以对您的数据库架构进行硬编码，这听起来像是您想要避免的，但会为您提供最佳性能，或者

leave the schema determined at runtime, stored in a 'master' table, which increases your flexibility, but reduces your ability to enforce referential integrity and set data types. 保留在运行时确定的模式，存储在“主”表中，这会增加您的灵活性，但会降低您实施参照完整性和设置数据类型的能力。

for awhile, you could try both approaches until you have enough info about which will perform better for your task. 有一段时间，你可以尝试这两种方法，直到你有足够的信息，哪些将更好地完成你的任务。

It's hard to be specific without understanding the problem domain, but if your data is inherently relational, use a relational model. 如果不了解问题域很难具体，但如果您的数据本质上是关系型的，请使用关系模型。 If your data is not inherently relational, I wouldn't try to force it into a relational model for the sake of it - the fact that all dataset happen to have an ID doesn't mean those IDs are the same. 如果你的数据本身并不是关系型的，那么我不会试图将它强制成关系模型 - 事实上所有数据集碰巧都有ID并不意味着这些ID是相同的。 Or even that they are suitable for use as a primary key. 或者甚至它们适合用作主键。

I'd suggest starting by having each data set in its own table (or tables if there are child records), and create a master table if you need to. 我建议首先将每个数据集放在自己的表中（如果有子记录，则为表），并在需要时创建主表。

I'd share zebediah49's question on "are you really going to use a database for this? Wouldn't flat files be better?" 我要分享zebediah49的问题“你真的要使用数据库了吗？平面文件不会更好吗？”

We store a bunch of data like this in their own flat file. 我们在他们自己的平面文件中存储了一堆这样的数据。 The header of the file contains enough information (timestamp, number of rows/cols...etc) so that it can be read. 该文件的标题包含足够的信息（时间戳，行/列数......等），以便可以读取它。 Then a meta information about this data is in the database. 然后，数据库中包含有关此数据的元信息。 At minimum this is the file location, but could include other information about the data. 这至少是文件位置，但可能包含有关数据的其他信息。 For example we aggregate the data into proxy variables that summarize the details at a high level. 例如，我们将数据聚合到代理变量中，以高级别汇总细节。 Typically, this summary data is good enough, but when necessary we can read the file for all the details. 通常，此摘要数据足够好，但在必要时，我们可以读取文件以获取所有详细信息。