简体   繁体   English

使用可选列纠正数据库的规范化

[英]Correct normalization of database with optional columns

I need to create a database table that stores parametric descriptions of physiological characteristics (eg systolic blood pressure, triglyceride concentrations, etc.) of a hypothetical cohort of patients. 我需要创建一个数据库表,该数据库表存储假设的一组患者的生理特征(例如,收缩压,甘油三酸酯浓度等)的参数描述。

For example, supposing the user specifies a triangular distribution for SBP, then the minimum, maximum and mode (and distribution type) would have to be stored. 例如,假设用户为SBP指定了三角形分布,则必须存储最小值,最大值和模式(以及分布类型)。 Alternatively, the user may specify a normal distribution, requiring storage of the mean and standard deviation. 或者,用户可以指定正态分布,需要存储平均值和标准偏差。

I'm struggling with the correct way to normalize these data. 我正在努力使用正确的方法来规范这些数据。 Currently, I have a Cohort table and a Distribution table with a number of one-to-one relationships as follows (some fields omitted): 目前,我有一个Cohort表和一个Distribution表,这些表具有如下所示的一对一关系(省略了一些字段):

Cohort
        id (INT, NOT NULL, PRIMARY)
        name (TEXT, NOT NULL)
        comments (TEXT)
        systolic_blood_pressure_dist (FOREIGN KEY referencing Distributions.id)
        triglyceride_dist (FOREIGN KEY referencing Distributions.id)
        ...other physiological parameters

    Distributions
        id (INT, NOT NULL, PRIMARY)
        distribution_type (TEXT)
        minimum (FLOAT)
        maximum (FLOAT)
        mean (FLOAT)
        mode (FLOAT)
        sd (FLOAT)
        ...other distribution parameters (alpha, beta, shape, scale, etc.)

(distribution_type holds a string describing the distribution: "Triangular", "Weibull", etc.) (distribution_type包含描述分布的字符串:“ Triangular”,“ Weibull”等。)

I'm pretty sure this is not the optimum way to do this as I'm left with loads of NULL fields in each row of Distributions. 我很确定这不是执行此操作的最佳方法,因为我在“分布”的每一行中都有大量的NULL字段。

My other thought was to have separate tables for each distribution type (one for triangular, one for Gaussian, one for uniform, etc.) and have a table in the middle with an id column (to be used as a foreign key in the Cohort table *_dist columns), a distribution type column and an id column to store the foreign key for the row in the appropriate distribution table. 我的另一种想法是为每种分布类型有一个单独的表(一个代表三角形,一个代表高斯,一个代表均匀,等等),中间有一个带有id列的表(用作同类群组的外键)表* _dist列),分布类型列和ID列,以在相应的分布表中存储该行的外键。

The query would use the id stored in the Cohort column to find the distribution type and row id from the middle table, then lookup the parameters in the appropriate table using the id. 该查询将使用存储在“同类群组”列中的ID从中间表中查找分布类型和行ID,然后使用该ID在相应的表中查找参数。 However, using a string to select the appropriate table, then another id to select the appropriate row is far from a traditional JOIN and also doesn't feel like a very clean approach. 但是,使用字符串选择适当的表,然后使用另一个ID选择适当的行与传统的JOIN相去甚远,而且感觉也不是很干净。

So, does anyone have any suggestions regarding how to best achieve this (in terms of normalization and/or performance)? 那么,有没有人对如何最好地实现这一点有任何建议(在规范化和/或性能方面)?

Many thanks, Rich 非常感谢,Rich

Cohort
    id (INT, NOT NULL, PRIMARY)
    name (TEXT, NOT NULL)
    comments (TEXT)

Parameters
    id (INT, NOT NULL, PRIMARY)
    name (TEXT, NOT NULL) ("systolic blood pressure", "trygliceride", ...)

CohortParameters
    id (INT, NOT NULL, PRIMARY)
    cohort_id (FOREIGN KEY referencing Cohort.id)
    parameter_id (FOREIGN KEY referencing Parameters.id)
    value (TEXT)

DistributionTypes
    id (INT, NOT NULL, PRIMARY)
    name (TEXT, NOT NULL) ("Triangular", "Weibull", ...)

Distributions
    id (INT, NOT NULL, PRIMARY)
    distribution_type_id (FOREIGN KEY referencing DistributionTypes.id)
    cohort_id (FOREIGN KEY referencing Cohort.id)
    parameter_id (FOREIGN KEY referencing Parameter.id)
    minimum (FLOAT)
    maximum (FLOAT)
    mean (FLOAT)
    mode (FLOAT)
    sd (FLOAT)
    ...other distribution parameters (alpha, beta, shape, scale, etc.)

Having separate tables for different distribution types sounds right to me. 对我来说,为不同的发行类型提供单独的表格听起来不错。 In your application logic, you'll have to special-case each distribution type, anyway (I presume), as it may need different rendering in the UI, or different computations. 在您的应用程序逻辑中,无论如何(我想),您都必须对每种分布类型进行特殊设置,因为它可能需要在UI中进行不同的渲染或进行不同的计算。

Your thought to have a table for each distribution type is probably what you want. 您想为每种分布类型都有一张表可能就是您想要的。 That way, you have a well-defined table with each value you need specific to your distribution type. 这样,您将拥有一个定义明确的表,其中包含您需要特定于您的分布类型的每个值。 This will save you space, will allow you to lock down which fields are nullable and which are not, and will result in increased performance. 这将节省您的空间,使您可以锁定哪些字段可以为空,哪些字段不能为空,并可以提高性能。 If each distribution has a common set of parameters, you could arrange your tables in a supertype/subtype relationship to further normalize the schema. 如果每个发行版都有一组通用的参数,则可以按超类型/子类型关系排列表,以进一步规范化架构。

How will you use the data when you query it? 查询数据时将如何使用它?

If you are querying a number of cohorts, and it's reasonable for the cohorts to have different distributions then your result would be a "union", where indeed many columns would be null. 如果您要查询多个队列,并且队列具有不同的分布是合理的,那么您的结果将是“联合”,实际上,许多列为空。 In which case your results are in some sense "not normal", but that doesn't mean that the schema should be. 在这种情况下,您的结果在某种意义上是“不正常的”,但这并不意味着该模式应该正确。

The advantage of having different tables for different distributions types is that each table would explicit define the columns that must be populated to describe that distribution, you can even then set some columns to be "not null". 对于不同的分布类型使用不同的表的优点是,每个表都将显式定义必须填充以描述该分布的列,您甚至可以将某些列设置为“非空”。

I like the general idea of your proposal. 我喜欢你的建议的总体思路。

Your design seems to indicate that there can only be one single type of distribution data per item of measured information. 您的设计似乎表明每一项测量信息只能有一种单一类型的分布数据。 It seems impossible, in your design, to have both "even distribution" and "triangular distribution" data on, say, "systolic blood pressure". 在您的设计中,似乎不可能同时具有“收缩压”的“均匀分布”和“三角分布”数据。

This seems to indicate that for each individual piece of "measured information", you already know upfront, at system design time, what kind of distribution data is available. 这似乎表明,对于每个单独的“测量信息”,您已经在系统设计时就已经预先了解了哪种分发数据可用。

This in turn seems to indicate that there is no need what so ever (and from a relational point of view it is outright bad to do so) to gather these different kinds of distribution in a single collection, only to reinstate any necessary distinction by adding a superfluous "distribution type" column. 这反过来表明似乎没有必要再这样做了(从关系的角度来看这样做绝对是不好的)将这些不同种类的分布收集到一个集合中,而只是通过添加添加来恢复任何必要的区别。多余的“分布类型”列。

EDIT 编辑

"The distribution type column also becomes necessary as soon as there are two or more cohorts in the database with differently distributed physiological parameters." “一旦数据库中有两个或多个具有不同生理参数分布的队列,分布类型列也将变得必要。”

That seems crap. 好像胡扯 Distinct cohorts hold distinct distribution measurement IDs, and distinct distribution measurement IDs can be of different distribution types by your very own design. 不同的群组拥有不同的分布度量ID,并且不同的分布度量ID根据您自己的设计可以具有不同的分布类型。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM