简体繁体 English

在数据联合表中存储宽格式数据帧

[英]Storing wide-form dataframes in datajoint table

原文 2022-07-25 16:49:16 1 1 pandas/ datajoint

Say I have some analysis that spits out a wide-form pandas dataframe with a multiindex on the index and columns.假设我有一些分析会吐出一个宽格式 pandas dataframe 在索引和列上具有多索引。 Depending on the analysis parameters, the number of columns may change.根据分析参数，列数可能会发生变化。 What is the best design pattern to use to store the outputs in a datajoint table?用于将输出存储在数据联合表中的最佳设计模式是什么？ The following come to mind, each with pros and cons想到以下几点，各有利弊

Reshape to long-form and store single entries with index x column levels as primary keys重塑为长格式并存储索引 x 列级别作为主键的单个条目

Pros: Preserves the ability to query/constrain based on both index and columns优点：保留了基于索引和列查询/约束的能力
Cons: Each analysis would insert millions of rows to the table, and I may have to do hundreds of such analyses.缺点：每次分析都会向表中插入数百万行，而我可能需要进行数百次这样的分析。 Even adding this many rows seems to take several minutes per dataframe, and queries become slow即使添加这么多行似乎每个 dataframe 也需要几分钟，并且查询变得缓慢

Keep as wide-form and store single rows as longblob with just index levels as primary keys保持宽格式并将单行存储为 longblob，仅将索引级别作为主键

Pros: Retain ability to query based on index levels, results in tables with a more reasonable number of rows优点：保留基于索引级别进行查询的能力，生成具有更合理行数的表
Cons: Loses the ability to query based on column levels, the columns would then also have to be stored somewhere to be able to reconstruct the original dataframes.缺点：失去了基于列级别进行查询的能力，然后列也必须存储在某个地方才能重建原始数据帧。 Since dataframes with different numbers of columns need to be stored in the same table, it is not feasible to explicitly encode all the columns in the table definition由于需要将不同列数的数据帧存储在同一张表中，因此对表定义中的所有列进行显式编码是不可行的

Store the dataframe itself as eg an h5 and store it in the database simply as a filepath or as an attachment将 dataframe 本身存储为例如 h5 并将其简单地作为文件路径或附件存储在数据库中

Pros: Does not result in large databases, simple to implement优点：不会导致大型数据库，易于实现
Cons: Does not really feel in the "spirit" of datajoint, lose the ability to perform constraints or queries缺点：没有真正感受到datajoint的“精神”，失去了执行约束或查询的能力

Are there any designs or pros/cons I haven't thought of?有没有我没有想到的设计或优点/缺点？

1 个解决方案

Before providing a more specific answer, let's establish a few basics (also known as normal forms).在提供更具体的答案之前，让我们建立一些基础知识（也称为范式）。

DataJoint implements the relational data model. DataJoint 实现了关系数据 model。 Under the relational model, complex dataframes of the type you described require normalization into multiple related tables related to each other through their primary keys and foreign keys.在关系 model 下，您描述的类型的复杂数据帧需要通过主键和外键规范化为多个相互关联的相关表。

Each table will represent a single entity class: Units and Trials will be represented in separate tables.每个表将代表单个实体 class：单元和试验将在单独的表中表示。

All entities in a given table will have the same attributes (columns).给定表中的所有实体都将具有相同的属性（列）。 They will be uniquely identified by the same attribute(s) comprising the primary key.它们将由构成主键的相同属性唯一标识。

In addition to the primary key, tables may have additional secondary indexes to accelerate queries.除了主键之外，表可能还有额外的二级索引来加速查询。

If you already knew about normalization, we can talk how about to normalize your design.如果您已经了解标准化，我们可以讨论如何标准化您的设计。 If not, we can refer you to a quick tutorial.如果没有，我们可以向您推荐一个快速教程。