简体   繁体   English

在 SQL Server 2005/2008 中存储历史数据的最佳方式是什么?

[英]What is the best way to store historical data in SQL Server 2005/2008?

My simplified and contrived example is the following:-我的简化和人为的例子如下:-

Lets say that I want to measure and store the temperature (and other values) of all the worlds' towns on a daily basis.假设我想每天测量和存储世界上所有城镇的温度(和其他值)。 I am looking for an optimal way of storing the data so that it is just as easy to get the current temperature in all the towns, as it is to get all the temperature historically in one town.我正在寻找一种存储数据的最佳方式,以便获取所有城镇的当前温度就像获取一个城镇的所有历史温度一样容易。

It is an easy enough problem to solve, but I am looking for the best solution.这是一个很容易解决的问题,但我正在寻找最佳解决方案。

The 2 main options I can think of are as follows:-我能想到的两个主要选项如下:-

Option 1 - Same table stores current and historical records选项 1 - 同一个表存储当前和历史记录

Store all the current and archive records in the same table.将所有当前和存档记录存储在同一个表中。

ie IE

CREATE TABLE [dbo].[WeatherMeasurement](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

This would keep everything simple, but what would be the most efficient query to get a list of towns and there current temperature?这将使一切变得简单,但是获取城镇列表和当前温度的最有效查询是什么? Would this scale once the table has millions of rows in?一旦表中有数百万行,这会扩展吗? Is there anything to be gained by having some sort of IsCurrent flag in the table?在表中添加某种 IsCurrent 标志有什么好处吗?

Option 2 - Store all archive records in a separate table选项 2 - 将所有存档记录存储在单独的表中

There would be a table to store the current live measurements in将有一个表来存储当前的实时测量值

CREATE TABLE [dbo].[WeatherMeasurement](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

And a table to store historical archived date (inserted by a trigger perhaps)以及存储历史存档日期的表(可能由触发器插入)

CREATE TABLE [dbo].[WeatherMeasurementHistory](
  MeasurementID [int] Identity(1,1) NOT Null,
  TownID [int] Not Null,
  Temp [int] NOT Null,
  Date [datetime] NOT Null,
)

This has the advantages of keeping the main current data lean, and very efficient to query, at the expense of making the schema more complex and inserting data more expensive.这具有保持主要当前数据精简且查询非常高效的优点,但代价是使模式更复杂和插入数据更昂贵。

Which is the best option?哪个是最好的选择? Are there better options I haven't mentioned?有没有我没有提到的更好的选择?

NOTE: I have simplified the schema to help focus my question better, but assume there will be alot of data inserted each day (100,000s of records), and data is current for one day.注意:我简化了架构以帮助更好地关注我的问题,但假设每天都会插入大量数据(100,000 条记录),并且数据是一天的最新数据。 The current data is just as likely to be queried as the historical.当前数据与历史数据一样有可能被查询。

它取决于应用程序的使用模式......如果使用模式表明历史数据将比当前值更频繁地查询,那么将它们全部放在一个表中......但是如果历史查询是例外,(或少于10% 的查询),并且更常见的当前值查询的性能将受到将所有数据放在一个表中的影响,那么将这些数据分离到它自己的表中是有意义的......

I would keep the data in one table unless you have a very serious bias for current data (in usage) or history data (in volume).我会将数据保存在一张表中,除非您对当前数据(使用中)或历史数据(数量)有非常严重的偏见。 A compound index with DATE + TOWNID (in that order) would remove the performance concern in most cases (although clearly we don't have the data to be sure of this at this time).在大多数情况下,带有 DATE + TOWNID(按此顺序)的复合索引将消除性能问题(尽管显然我们目前没有数据来确定这一点)。

The one thing I would wonder about is if anyone will want data from both the current and history data for a town.我想知道的一件事是是否有人想要来自城镇的当前和历史数据的数据。 If so, you just created at least one new view to worry about and possible performance problem in that direction.如果是这样,您至少创建了一个新视图来担心该方向可能出现的性能问题。

This is unfortunately one of those things where you may need to profile your solutions against real world data.不幸的是,这是您可能需要根据现实世界数据分析您的解决方案的事情之一。 I personally have used compound indexes such as specified above in many cases, and yet there are a few edge cases where I have opted to break the history into another table.我个人在很多情况下都使用了上面指定的复合索引,但也有一些边缘情况我选择将历史记录分解到另一个表中。 Well, actually another data file, because the problem was that the history was so dense that I created a new data file for it alone to avoid bloating the entire primary data file set.嗯,实际上是另一个数据文件,因为问题是历史记录密集了,我单独为它创建了一个新的数据文件,以避免使整个主数据文件集膨胀。 Performance issues are rarely solved by theory.性能问题很少通过理论来解决。

I would recommend reading up on query hints for index use, and "covering indexes" for more information about performance issues.我建议阅读索引使用的查询提示,并“覆盖索引”以获取有关性能问题的更多信息。

Your table is very narrow and would probably perform in a single properly indexed table which would never outstrip the capacity of SQL Server in a traditional normalized OLTP model, even for millions and millions of rows.您的表非常狭窄,可能会在单个正确索引的表中执行,该表永远不会超过 SQL Server 在传统规范化 OLTP 模型中的容量,即使对于数百万行也是如此。 Even with dual-table model advantages can be mitigated by using table partitioning in SQL Server.即使使用双表模型,也可以通过在 SQL Server 中使用表分区来减轻优势。 So it doesn't have much to recommend it over the single table model.所以与单表模型相比,它没有太多值得推荐的地方。 This would be an Inmon-style or "Enterprise Data Warehouse"- scenario.这将是 Inmon 风格或“企业数据仓库”场景。

In much bigger scenarios, I would transfer the data to a data warehouse (modeled with a Kimball-style dimensional model) on a regular basis and simply purge the live data - in some simple scenarios like yours, there might effectively be NO live data - it all goes straight into the warehouse.在更大的场景中,我会定期将数据传输到数据仓库(使用 Kimball 风格的维度模型建模)并简单地清除实时数据 - 在像您这样的一些简单场景中,可能实际上没有实时数据 -这一切都直接进入仓库。 The dimensional model has a lot of advantages when slicing data different ways and storing huge numbers of facts with a variety of dimensions.维度模型在以不同方式对数据进行切片并存储具有各种维度的大量事实时具有很多优势。 Even in the data warehouse scenario, often fact tables are partitioned by date.即使在数据仓库场景中,事实表也经常按日期分区。

It might not seem like your data has this (Town and Date are your only explicit dimensions), however, in most data warehouses, dimensions can snowflake or there can be redundancy, so there would be other dimensions about the fact stored at time of load instead of snowflaking for more efficiency - like State, Zip Code, WasItRaining, IsStationUrban (contrived).您的数据似乎没有这个(城镇和日期是您唯一的显式维度),但是,在大多数数据仓库中,维度可能会雪花状或可能存在冗余,因此在加载时存储的事实会有其他维度而不是使用雪花来提高效率——比如 State、Zip Code、WasItRaining、IsStationUrban(人为设计的)。

This might seem silly, but when you start to mine the data for results in data warehouses, this makes asking questions like - on a day with rain in urban environments, what was the average temperature in Maine?这可能看起来很愚蠢,但是当您开始在数据仓库中挖掘数据以获取结果时,就会提出以下问题:在城市环境下雨的一天,缅因州的平均温度是多少? - just that little bit easier to get at without joining a whole bunch of tables (ie it doesn't require a lot of expertise on your normalized model and performs very quickly). - 只是在不加入一大堆表格的情况下更容易上手(即它不需要很多关于规范化模型的专业知识并且执行速度非常快)。 Kind of like useless stats in baseball - but some apparently turn out to be useful.有点像棒球中无用的统计数据 - 但有些显然是有用的。

I suggest keep in the same table since historical data is queried just as often.我建议保留在同一张表中,因为查询历史数据的频率一样高。 Unless you will be adding many more columns to the table.除非您将向表中添加更多列。

When size becomes an issue, you can partition it out by decade and have a stored procedure union the requested rows.当大小成为问题时,您可以按十年将其分区,并使用存储过程联合请求的行。

Another alternative could be to go for one table for all data and have a view for the current temperature.另一种选择可能是为所有数据查找一个表并查看当前温度。 This will not help performance but could well aid readability/maintainability.这对性能没有帮助,但可以很好地提高可读性/可维护性。 You could even go for an indexed view to improve performance if you have the appropriate version of sql .如果您有合适的 sql 版本,您甚至可以使用索引视图来提高性能。

I would use a single table with index views to provide me with the latest information.我将使用带有索引视图的单个表来为我提供最新信息。 SQL 2005 and 2008 server are designed for data warehousing so should preform well under this condition. SQL 2005 和 2008 服务器是为数据仓库设计的,因此在这种情况下应该能很好地执行。

If you have a data pattern that requires writing to the db often, then the best choice would be to have an active table and archive table that you batch update at some interval.如果您有一个需要经常写入数据库的数据模式,那么最好的选择是拥有一个活动表和存档表,您可以在某个时间间隔批量更新。

If you store all in one table how are you going to make a relational database.如果您将所有内容存储在一张表中,您将如何制作关系数据库。

Example:例子:

id--------------GUID----PK id--------------GUID----PK

record_id-------GUID record_id-------GUID

every time a new record will be inserted the [id] will change but [record_id] will remain same.每次插入新记录时,[id] 都会改变,但 [record_id] 将保持不变。 Now if you have to link it with address table how are you going to do that?现在,如果您必须将它与地址表链接起来,您将如何做到这一点?

Instead of trying to optimize relational databases for this, you might want to consider using a Time series database .与其尝试为此优化关系数据库,不如考虑使用时间序列数据库 These are already optimized for dealing with time-based data.这些已经针对处理基于时间的数据进行了优化。 Some of their advantages are:它们的一些优点是:

  • Faster at querying time-based keys查询基于时间的键更快
  • Large data throughput大数据吞吐量
    • Since default operation is just an append, this can be done very quickly.由于默认操作只是一个附加操作,因此可以很快完成。 ( InfluxDb supports millions of data points per second ). InfluxDb支持数以百万计的数据点每秒)。
  • Able to compress data more agressively能够更积极地压缩数据
  • More user-friendly for time-series data.对时间序列数据更加用户友好。
    • The API's tend to reflect typical use-cases for time-series data API 倾向于反映时间序列数据的典型用例
    • Aggregate metrics can be automatically calculated (eg windowed averages)可以自动计算聚合指标(例如窗口平均值)
    • Specific visualization tools are often available.通常可以使用特定的可视化工具。

Personally I liked using the open source database InfluxDB , but other good alternatives are available.我个人喜欢使用开源数据库InfluxDB ,但也有其他不错的选择。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 SQL Server 2008:插入大数据的最佳方法是什么? - SQL Server 2008 : What is the best way for inserting big chunk of data? 在SQL Server 2005中处理此约束的最佳方法是什么? - What is the best way to handle this constraint in SQL Server 2005? 在sql server 2005中执行通配符搜索的最佳方法是什么? - What is the best way to do a wildcard search in sql server 2005? 谁能建议将数据库从SQL Server 2008迁移到2005的最佳方法? - Can anyone suggest the best way to move a database from SQL Server 2008 to 2005? 在SQL Server 2008中匿名化ID值的最佳方法是什么 - What is the best way to anonymize ID values in sql server 2008 在MySQL表中存储历史价格表的最佳方法是什么? - What is the best way to store a historical price list in a MySQL table? 如何在SQL Server 2005/2008中将特定查询返回的所有数据存储在表中的单个字符串中 - How to store all data return by certain query in a table in a single string in SQL Server 2005/2008 在sql server 2008中使用dephi插入数据的最佳方法 - best way to insert data using dephi in sql server 2008 删除重复的分组数据的最佳方法-SQL Server 2008 - Best way to remove duplicated grouped data - SQL Server 2008 带有SQL 2005的Windows Server 2008 R2最佳数据库 - Best Database for Windows Server 2008 R2 with SQL 2005
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM