简体   繁体   English

时间序列和相关策略

[英]Time-series and correlation strategies

I have various time-series I'd like to correlate and present as either a csv-file or in-memory datatable (.NET). 我想关联各种时间序列,并以csv文件或内存数据表(.NET)的形式呈现。 These time-series are arrays of time-value-pairs (actually these are objects containing more than just time and value). 这些时间序列是时间-值对的数组(实际上,这些对象不仅仅包含时间和值)。 The time-series may span across different overlapping periods, and some might even have holes (missing values for given timestamps). 时间序列可能跨越不同的重叠周期,有些甚至可能有空洞(缺少给定时间戳记的值)。

For those interested, I'm using the OPC HDA .NET library to extract historic time-series from an OPC HDA server. 对于那些感兴趣的人,我正在使用OPC HDA .NET库从OPC HDA服务器提取历史时间序列。

The resulting datatable should have one column for each time-series all in chronological order based on a timestamp column. 生成的数据表应基于时间戳列按时间顺序对每个时间序列全部使用一列。 See example below: 请参见下面的示例:

|-------|-------|-------|-------|-------|
   TIME    TS1     TS2     TS3     TS4
|-------|-------|-------|-------|-------|
    1       X               X       X
|-------|-------|-------|-------|-------|
    2       X       X       X       X
|-------|-------|-------|-------|-------|
    3       X       X               X
|-------|-------|-------|-------|-------|
    4       X       X       X 
|-------|-------|-------|-------|-------|
    5       X       X       X 
|-------|-------|-------|-------|-------|

What would be the most effective way of achieving this? 实现这一目标的最有效方法是什么? With "effective" I mean with the least amount of code. “有效”是指最少的代码量。 But considering that the timeseries could become quite large, memory usage might also be an issue. 但是考虑到时间序列可能会变得很大,因此内存使用也可能是一个问题。

You can first scan all present series for the distinct values (for example, aggregating them in a HashSet), then simply dump them into an array of dates (storing a match between date and index position in a dictionary). 您可以首先扫描所有当前序列以查找不同的值(例如,在HashSet中对其进行汇总),然后将其直接转储到日期数组中(在字典中的日期和索引位置之间存储匹配项)。

var distinctDates = allSeries
  .SelectMany(s => s.Values.Select(v => v.Date))
  .Distinct()
  .OrderBy(d => d)
  .ToArray();

var datePositions = distinctDates
  .Select((d,index) => new 
    {
      Date = d,
      Index = index
    }).
  .ToDictionary(x => x.Date, x => x.Index);

Then, create a jagged array that has width of "NumberOfSeries" and length of "NumberOfDates". 然后,创建一个锯齿状数组,其宽度为“ NumberOfSeries”,长度为“ NumberOfDates”。 After that, do a second scan of all the data and dump them to their positions. 之后,对所有数据进行第二次扫描并将其转储到它们的位置。

var values = new float[allSeries.Length][];
for (var i=0;i<allSeries.Length;i++)
{
  values[i] = new float[distinctDates.Length];
  var currentSerie = allSeries[i];
  foreach(var value in currentSerie.Values)
  {
    var index = datePositions[value.Date];
    values[i][index] = value.Value;
  }      
}

I wrote this code without touching VisualStudio, so I may have a few typos. 我编写此代码时没有接触VisualStudio,所以我可能会有一些错别字。 Or there may be used a few LINQ methods that are not present in the .NET (just look in Lokad.Shared.dll ). 或者,可能会使用.NET中不存在的一些LINQ方法(只需查看Lokad.Shared.dll即可 )。 But you should be able to get the idea. 但是您应该能够理解。

Some more notes, while I'm at the topic: 在我讨论主题时,还有一些注意事项:

  1. Go for the jagged array, if you have to keep everything in the memory at once. 如果必须一次将所有内容保留在内存中,请使用锯齿状的数组。 It is way more efficient than a dictionary and has a lot less memory problems than a rectangular array. 它比字典更有效,并且比矩形数组具有更少的内存问题。

  2. Keep Value objects as small as possible (ie: float instead of double). 保持Value对象尽可能小(即:float而不是double)。

  3. If number of time serie values is expected to go large in the future, then never store values in database in a "one row per value". 如果预期将来时间序列值的数量会变大,则永远不要将值存储在数据库中“每个值一行”。 It is recommended to either go for something like HDF (which has a .NET interface) or use persist time serie fragments in binary form in DB (as in time serie databases ) 建议要么使用HDF之类的东西(具有.NET接口),要么在DB中使用二进制形式的持久性时间序列片段(例如在时间序列数据库中

Sticking to these should allow you to scale up to hundreds of millions of time values without a lot of problems (done that). 坚持这些原则,您就可以扩展到数亿个时间值,而不会遇到很多问题(做到这一点)。

You might go with a data structure like a nested dictionary and iterate over the contents: 您可能会使用诸如嵌套字典之类的数据结构并遍历其内容:

Dictionary <TimeSeries, Dictionary<DateTime, Value>> dict = new Dictionary<TimeSeries, Dictionary<DateTime, Value>>();

foreach (TimeSeries series in dict.Keys) {

    //table row output code goes here
    Dictionary<DateTime, Value> innerDict = dict[series];
    foreach (DateTime date in innerDict.Keys) {
        Value seriesValueAtTimeT = innerDict[date];
        //table column output code goes here
    }
}

Where your output code is writing out to something else, depending on your needs, and you replace the datatypes TimeSeries, Value, etc., with your actual data types. 根据您的需要,您的输出代码正在写到其他地方,并用实际的数据类型替换数据类型TimeSeries,Value等。

Hey Chris. 嗨,克里斯。 I realize you've already accepted an answer, but thought I'd post a solution I use. 我知道您已经接受了答案,但以为我会发布我使用的解决方案。 Hopefully it's of some use to someone. 希望它对某人有用。 If not at least it provides a place for me to find it sometime in the future. 如果不是这样的话,至少它为我提供了一个将来找到它的地方。 :-) :-)

This is VBA code fired off directly from an Excel 2007 code module. 这是直接从Excel 2007代码模块触发的VBA代码。 It could be easily converted to .Net. 它可以轻松转换为.Net。

The key for the data manipulation is the pivot table object. 数据操作的关键是数据透视表对象。 I've found it to be massively efficient at getting the data into the layout you've specified. 我发现将数据放入您指定的布局非常有效。

Sub GetIndexData ()
Dim cn as ADODB.Connection, cmd As ADODB.Command, rs As ADODB.Recordset
Dim rPivotTopLeft As Range, rPivotBottomRight As Range

Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual

'Get the data.'
Set cn = New ADODB.Connection
With cn
  .Provider = "SQLOLEDB"
  .ConnectionString = "Database=" & mDBName & ";" & _
                      "Server=" & mDBServerName & ";" & _
                      "UID=" & mDBUserID & ";" & _
                      "Password=" & mDBPassword & ";" & _
                      "Persist Security Info=True;"
  .CursorLocation = adUseClient
  .Open
End With

Set cmd = New ADODB.Command
Set rs = New ADODB.Recordset
With cmd
  .ActiveConnection = adoTools.DBConnection
  .CommandType = adCmdText
  .CommandText = "SELECT YourData From YourSource WHERE YourCritera"
  Set rs = .Execute
End With



If Not (rs.BOF And rs.EOF) Then 'Check that we have some data.'

'Put the data into a worksheet.'
With wsRawData
  .Cells.CurrentRegion.Clear

  Set rPivotTopLeft = .Range("A1")
  With ThisWorkbook.PivotCaches.Add(SourceType:=xlExternal)
    Set .Recordset = rs
    .CreatePivotTable _
        TableDestination:=rPivotTopLeft, _
        TableName:="MyPivotTable"
  End With

  'Massage the data into the desired layout.'
  With .PivotTables("MyPivotTable")
    .ManualUpdate = True

    .PivotFields("Date").Orientation = xlRowField
    .PivotFields("Index").Orientation = xlColumnField
    .AddDataField .PivotFields("Return"), "Returns", xlSum

    .DisplayFieldCaptions = False
    .ColumnGrand = False
    .RowGrand = False

    .ManualUpdate = False
  End With

  mMonthCount = Range(.Range("A3"), .Cells(Rows.Count, "A").End(xlUp)).Count
  mIndexCount = Range(.Range("B2"), .Cells(2, Columns.Count).End(xlToLeft)).Count

  'Convert pivot table to values.'
  Set rPivotBottomRight = .Cells(mMonthCount + 2, mIndexCount + 1)
  With .Range(rPivotTopLeft, rPivotBottomRight)
    .Copy
    .PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks:=False, Transpose:=False
  End With

  'Format the worksheet.'
  .Range("A3").Resize(mMonthCount, 1).NumberFormat = "mmm-yy"
  .Range("B3").Resize(mMonthCount, mIndexCount).NumberFormat = "0.00%"
  Union(.Rows(2), .Columns(1)).Font.Bold = True
  .Cells.ColumnWidth = 7.14
  .Rows(1).Delete

End With


rs.close
Set rs = Nothing
cmd.ActiveConnection = Nothing
Set cmd = Nothing
cn.close
Set cn = Nothing

End Sub

From there its relatively easy to leverage the built in excel regression statistics to output a correlation matrix. 从那里,相对容易地利用内置的excel回归统计数据来输出相关矩阵。 With this technique I'm producing a worksheet with a 600x600 matrix of correlations in about 45 seconds. 通过这种技术,我可以在大约45秒内生成具有600x600相关矩阵的工作表。

Note that the .PivotFields parameters should be changed to fit the column names of your data from your data source. 请注意,应更改.PivotFields参数以适合数据源中数据的列名。

One thing you can do if memory usage becomes a problem is to move from tracking single events to tabulated counts of an event within a given time period. 如果内存使用出现问题,您可以做的一件事是在给定的时间段内从跟踪单个事件变为事件的列表化计数。 You lose some precision about exactly when things happened, but often summarizing your data like this can reduce the complexity of your picture and make trends more obvious. 您可能无法准确地知道事情发生的确切时间,但是经常像这样汇总数据会降低图片的复杂性并使趋势更加明显。

In case it's not obvious, I mean taking data that looks like this: 如果不是很明显,我的意思是获取看起来像这样的数据:

12:00 event1
12:01 event2
12:10 event1
12:11 event1

to this: 对此:

12:00-12:15 event1 3
12:00-12:15 event2 1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM