简体   繁体   English

SQL模式设计建议

[英]SQL Schema Design Advice

I have a 'users' table which has a bunch of concrete "sure" properties about my users all of which must be there and their veracity is certain and then I have a separate table 'users_derived' where all data in this table is derived properties of my users guessed by machine learning models. 我有一个“用户”表,其中有一堆关于我的用户的具体“确定”属性,所有这些属性都必须存在并且它们的准确性是确定的,然后我有一个单独的表“ users_derived”,该表中的所有数据都是派生属性机器学习模型对我的用户的猜测。 For example: 'age' might be a certain property since they supplied it to me, 'height' or 'hair color' might be a derived property since an ML model guessed it from a picture. 例如:“年龄”可能是某种属性,因为他们将其提供给我,“身高”或“头发颜色”可能是派生的属性,因为ML模型从图片中猜出了它。 The main difference is all properties in the 'users' table were given to me by the user themselves and have complete certainty whereas all properties in the 'user_derived' table have both the value and a certainty associated with it and were guessed at by my system. 主要的区别是“用户”表中的所有属性均由用户自己提供给我,并且具有完全确定性,而“ user_derived”表中的所有属性都具有与之相关的值和确定性,并且由我的系统猜测。 The other difference is all properties of the 'users' table will be there for every user, while any property in the 'users_derived' table may or may not be there. 另一个区别是,“用户”表的所有属性将为每个用户存在,而“ users_derived”表中的任何属性可能存在也可能不存在。 From time to time I add new ML models which guess at more properties of users as well. 我不时添加新的ML模型,这些模型也可以猜测用户的更多属性。

My question is how to do the schema for the 'users_derived' table. 我的问题是如何为“ users_derived”表做模式。 I could do it like this: 我可以这样做:

userid  |  prop1  | certainty1  |  prop2  | certainty2 | prop3 |  etc ...
123         7         0.57         5'8''       0.82       red
124         12        0.6          NULL        NULL       black
125         NULL      NULL         6'1''       0.88       blonde

or I could do it like this with slightly different indexing: 或者我可以使用稍微不同的索引来做到这一点:

userid   |  property  |  value   |   certainty 
 123           1           7            0.57
 123           2          5'8''         0.82
 124           1           12           0.60
 123           3          red           0.67
 124           3          black         0.61
 125           2          6'1''         0.88
                       etc ....

So the tradeoffs seem like in the second way it isn't as normalized and might be slightly harder to query but you don't have to know all the properties you care about in advance -- that is if I want to add a new property there is no schema change. 因此,折衷似乎是第二种方式,它不是标准化的,可能难以查询,但您不必事先知道所有您关心的属性-也就是说,如果我想添加新属性没有架构更改。 Also there don't have to be any NULL spots since if we don't have that property yet we just don't have a row for it. 另外,也不必有NULL点,因为如果我们没有该属性,那么我们只是没有一行。 What am I missing? 我想念什么? What are the benefits of the first way? 第一种方式的好处是什么? Are there queries I can do against the first schema that are hard or impossible in the second schema? 我是否可以针对第一个模式执行在第二个模式中很难执行或无法执行的查询? Does the second way somehow need more space for indexing to make it fast? 第二种方式是否需要更多的索引空间来使其快速?

The second way is more normalized. 第二种方法更加规范化。 Both the table and the indexes are likely to be more compact, especially if the first form is relatively sparsely populated. 表和索引都可能会更紧凑,尤其是在第一种形式相对稀疏的情况下。 Although the two forms have different tradeoffs for different queries, in general the second form is more flexible and better suited to a wide variety of queries. 尽管两种形式对不同的查询都有不同的权衡,但是通常第二种形式更灵活,更适合各种查询。 If you want to transform data from the normalized form to the crosstabbed form, there is a crosstab function in Postgres' tablefunc extension that can be used for this purpose. 如果要将数据从规范化表单转换为交叉表表单,则Postgres的tablefunc扩展中有一个crosstab函数可用于此目的。 Normalizing crosstabbed data will be more difficult, especially if the number of columns is indeterminate--yet you may need to do that for some types of queries. 标准化交叉表数据将更加困难,尤其是在列数不确定的情况下-但是对于某些类型的查询,您可能需要这样做。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM