在大数据集中查找重复

Question

I have a data set with data about failures in a control system. 我有一个数据集，其中包含有关控制系统故障的数据。 These data have following structure: 这些数据具有以下结构：

TYPE OF FAILURE (string), START DATE (dd/mm/yyyy), START TIME (hh/mm/ss), DURATION (ss), LOCALIZATION (string), WORKING TEAM (A,B,C), SHIFT (morning, afternoon, night)

The table with data has 555000 rows. 包含数据的表有555000行。 First, I would like to analyze if there are repetitive failure sequences with respect to the START DATE parameter. 首先，我想分析是否存在与START DATE参数相关的重复失败序列。 Basicly, I would like to find something like this: 基本上，我想找到这样的东西：

Failure 1 emerged on March 10. Failure 2 emerged on March 15. There is 5 days between them. 失败1于3月10日出现。失败2于3月15日出现。他们之间有5天。 Then Failure 1 emerged on April 10 and April 15, where are also 5 days between them. 然后失败1出现在4月10日和4月15日，他们之间也是5天。 Than Failure 1 emerged on May 10 and May 15 also with 5 days between them. 比失败1在5月10日和5月15日出现，它们之间也有5天。 However Failure 1 could emerged also on different dates, but for me it is interesting to know, that there is stronger possibility, that Failure 2 will emerge 5 days after Failure 1 and that between these events (F1->F2) is one month. 然而失败1也可能在不同的日期出现，但对我而言，有趣的是，有更强的可能性，失败2将在失败1后5天出现，并且这些事件之间（F1-> F2）是一个月。

I don't know if my explanation is clear enough. 我不知道我的解释是否足够清楚。 However I am searching for suitable methods / algorithms with which I will be able to extract such sequences from the data describet above. 然而，我正在寻找合适的方法/算法，我将能够从上面的数据描述中提取这些序列。 Can you please point me to some methods? 你能指点一些方法吗？ Or simply let's brainstorm together :). 或者简单地让我们一起集思广益:) Any help appreciated. 任何帮助赞赏。

PS: I plan to implement this in C# or MATLAB (depends on suitable method) Thanks. PS：我打算用C＃或MATLAB实现这个（取决于合适的方法）谢谢。

Answer 1

Your file looks like a big CSV for that matlab has a good implementation with the Data Store 您的文件看起来像一个大的CSV，因为该matlab与Data Store有很好的实现

https://es.mathworks.com/help/matlab/import_export/what-is-a-datastore.html https://es.mathworks.com/help/matlab/import_export/what-is-a-datastore.html

And has this tools for working with large files: 并且有这个工具来处理大文件：

https://es.mathworks.com/help/matlab/large-files-and-big-data.html https://es.mathworks.com/help/matlab/large-files-and-big-data.html

And also take a look to working with tables in matlab 还要看看在matlab中使用表格

In your case you can work something like this: 在你的情况下，你可以这样工作：

the sample file airlinessmall.csv (123524 lines) 样本文件airlinessmall.csv（123524行）

Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
1987,10,21,3,642,630,735,727,PS,1503,NA,53,57,NA,8,12,LAX,SJC,308,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,26,1,1021,1020,1124,1116,PS,1550,NA,63,56,NA,8,1,SJC,BUR,296,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,23,5,2055,2035,2218,2157,PS,1589,NA,83,82,NA,21,20,SAN,SMF,480,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,23,5,1332,1320,1431,1418,PS,1655,NA,59,58,NA,13,12,BUR,SJC,296,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,22,4,629,630,746,742,PS,1702,NA,77,72,NA,4,-1,SMF,LAX,373,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,28,3,1446,1343,1547,1448,PS,1729,NA,61,65,NA,59,63,LAX,SJC,308,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,8,4,928,930,1052,1049,PS,1763,NA,84,79,NA,3,-2,SAN,SFO,447,NA,NA,0,NA,0,NA,NA,NA,NA,NA
1987,10,10,6,859,900,1134,1123,PS,1800,NA,155,143,NA,11,-1,SEA,LAX,954,NA,NA,0,NA,0,NA,NA,NA,NA,NA

...

With data store tou can work with data as tables and get the variables you need, so for example to get de mean of the arrival delays: 使用数据存储tou可以将数据作为表来处理并获取所需的变量，例如得到到达延迟的平均值：

>> ds = datastore('airlinesmall.csv','TreatAsMissing','NA');
>> ds.MissingValue = 0;
>> ds.SelectedVariableNames = 'ArrDelay';
>> data = preview(ds)

data = 

    ArrDelay
    ________

     8      
     8      
    21      
    13      
     4      
    59      
     3      
    11      

>> data % this is a table

data = 

    ArrDelay
    ________

     8      
     8      
    21      
    13      
     4      
    59      
     3      
    11      

>> sums = [];
counts = [];
while hasdata(ds)
    T = read(ds); % this is a table, but this is not all loaded in memory

    sums(end+1) = sum(T.ArrDelay);
    counts(end+1) = length(T.ArrDelay);
end

>> avgArrivalDelay = sum(sums)/sum(counts)

avgArrivalDelay =

    6.9670

Let's work with your sample. 让我们一起使用您的样本。 check this file: 检查这个文件：

sample.csv sample.csv

TYPE OF FAILURE, START DATE, START TIME, DURATION, LOCALIZATION, WORKING TEAM, SHIFT
failure 1, 06/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 2, 06/01/2017, 12/13/20, 300,  Area 1, A, night
failure 3, 06/01/2017, 12/13/20, 400,  Area 1, A, afternoon
failure 1, 08/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 2, 09/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 3, 09/01/2017, 12/13/20, 300,  Area 1, A, night
failure 3, 09/01/2017, 14/13/20, 200,  Area 1, A, morning
failure 1, 10/01/2017, 12/13/20, 300,  Area 1, A, morning
failure 1, 12/01/2017, 12/13/20, 300,  Area 1, A, afternoon
failure 2, 12/01/2017, 12/13/20, 500,  Area 1, A, morning
failure 1, 14/01/2017, 12/13/20, 300,  Area 1, A, night

You can see that failure 1 is every two days let's see this: 您可以看到故障1每两天让我们看到这个：

>> ds = tabularTextDatastore('sample.csv')
Warning: Variable names were modified to make them valid MATLAB identifiers. 

ds = 

  TabularTextDatastore with properties:

                      Files: {
                             '/home/anquegi/learn/matlab/stackoverflow/sample.csv'
                             }
               FileEncoding: 'UTF-8'
          ReadVariableNames: true
              VariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%q', '%q', '%q' ... and 4 more}
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
            SelectedFormats: {'%q', '%q', '%q' ... and 4 more}
                   ReadSize: 20000 rows

>> ds.SelectedVariableNames = {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME', 'DURATION', 'LOCALIZATION', 'WORKINGTEAM', 'SHIFT'}

ds = 

  TabularTextDatastore with properties:

                      Files: {
                             '/home/anquegi/learn/matlab/stackoverflow/sample.csv'
                             }
               FileEncoding: 'UTF-8'
          ReadVariableNames: true
              VariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}

  Text Format Properties:
             NumHeaderLines: 0
                  Delimiter: ','
               RowDelimiter: '\r\n'
             TreatAsMissing: ''
               MissingValue: NaN

  Advanced Text Format Properties:
            TextscanFormats: {'%q', '%q', '%q' ... and 4 more}
         ExponentCharacters: 'eEdD'
               CommentStyle: ''
                 Whitespace: ' \b\t'
    MultipleDelimitersAsOne: false

  Properties that control the table returned by preview, read, readall:
      SelectedVariableNames: {'TYPEOFFAILURE', 'STARTDATE', 'STARTTIME' ... and 4 more}
            SelectedFormats: {'%q', '%q', '%q' ... and 4 more}
                   ReadSize: 20000 rows

>> reset(ds)
accum = [];
while hasdata(ds)
    T = read(ds);
    accum = datetime(T(strcmp(T.TYPEOFFAILURE,'failure 1'),:).STARTDATE, 'InputFormat','dd/MM/yyyy');
    mean(diff(accum))
end

ans = 

   48:00:00

% Exactly every 48 hours, and then you can try with every thing you want ％恰好每48小时一次，然后你就可以试试你想要的每一件事

在大数据集中查找重复

问题描述

1 个解决方案

解决方案1
0 2017-01-10 11:07:38

在大数据集中查找重复

问题描述

1 个解决方案

解决方案1 0 2017-01-10 11:07:38

解决方案1
0 2017-01-10 11:07:38