简体   繁体   English

数据库中的大量课程

[英]Large amount of timecourses in database

I have a rather large amount of data (~400 mio datapoints) which is organized in a set of ~100,000 timecourses. 我有大量的数据(约400个mio数据点),这些数据按约100,000个时程进行组织。 This data may change every day and for reasons of revision-safety has to be archived daily. 该数据可能每天都会更改,出于修订安全的原因,必须每天存档。

Obviously we are talking about way too much data to be handled efficiently, so I made some analysis on sample data. 显然,我们谈论的是太多数据无法有效处理的方式,因此我对样本数据进行了一些分析。 Approx. 60 to 80% of the courses do not change at all between two days and for the rest only a very limited amount of the elements changes. 60%至80%的课程在两天之内完全不更改,而对于其余课程,只有很少量的元素会更改。 All in all I expect much less than 10 mio datapoints change. 总的来说,我希望少于10个mio数据点发生变化。

The question is, how do I make use of this knowledge? 问题是,我该如何利用这些知识? I am aware of concepts like the Delta-Trees used by SVN and similar techniques, however I would prefer, if the database itself would be capable of handling such semantic compression. 我知道诸如SVN使用的Delta-Tree和类似技术之类的概念,但是我更希望数据库本身能够处理这种语义压缩。 We are using Oracle 11g for storage and the question is, is there a better way than a homebrew solution? 我们正在使用Oracle 11g进行存储,问题是,是否有比自制解决方案更好的方法?

Clarification 澄清度

I am talking about timecourses representing hourly energy-currents. 我说的是代表每小时能量流的时程。 Such a timecourse might start in the past (like 2005), contains 8760 elements per year and might end any time up to 2020 (currently). 这样的时间过程可能始于过去(例如2005年),每年包含8760个元素,并且可能直到2020年(当前)的任何时间结束。 Each timecourse is identified by one unique string. 每个时程由一个唯一的字符串标识。

The courses themselves are more or less boring: "Course_XXX: 1.1.2005 0:00 5; 1.1.2005 1:00 5;1.1.2005 2:00 7,5;..." 这些课程本身或多或少很无聊:“ Course_XXX:1.1.2005 0:00 5; 1.1.2005 1:00 5; 1.1.2005 2:00 7,5; ...”

My task is making day-to-day changes in these courses visible and to do so, each day at a given time a snapshot has to be taken. 我的任务是使这些课程的日常更改可见并做到这一点,每天必须在给定时间拍摄快照。 My hope is, that some loss-free semantical compression will spare me from archiving ~20GB per day. 我的希望是,一些无损的语义压缩将使我免于每天存档约20GB。

Basically my source data looks like this: 基本上我的源数据如下所示:

Key | Value0 | ... | Value23

to archive that data I need to add an additional dimension which directly or indirectly tells me the time at which the data was loaded from the source-system, so my archive-database is 要存档该数据,我需要添加一个额外的维度,该维度直接或间接告诉我从源系统加载数据的时间,因此我的存档数据库是

Key | LoadID | Value0 | ... | Value23

Where LoadID is more or less the time the source-DB was accessed. 其中LoadID或多或少是访问源数据库的时间。

Now, compression in my scenario is easy. 现在,在我的方案中压缩很容易。 LoadIDs are growing with each run and I can give a range, ie LoadID随每次运行而增长,我可以给出一个范围,即

Key | LoadID1 | LoadID2 | Value0 | ... | Value23

Where LoadID1 gives me the ID of the first load where the 24 values where observed and LoadID2 gives me the ID of the last consecutive load where the 24 values where observed. 其中LoadID1给我第一个载荷的ID,其中观察到24个值,而LoadID2给我给我最近的连续载荷的ID,其中观察到24个值。

In my scenario, this reduces the amount of data stored in the database to 1/30th 在我的情况下,这会将数据库中存储的数据量减少到1/30

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM