在巨大的文件中合並CSV行

Question

我有一個看起來像這樣的CSV

783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:15,1,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:30,2,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y

雖然有50億條記錄。 如果您注意到第一列和第二列的一部分（當天），則其中三個記錄全部“分組”在一起，並且只是當天前30分鍾的15分鍾間隔的細分。

我希望輸出看起來像

783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y

省略重復行的前4列，其余列與其類型的第一個記錄組合。 基本上我轉換的每一行是從每行15分鍾到每行是1天。

由於我將處理50億條記錄，我認為最好的方法是使用正則表達式（和EmEditor）或為此制作的一些工具（多線程，優化），而不是自定義編程解決方案。 雖然我對nodeJS或C＃中相對簡單且超快的想法持開放態度。

如何才能做到這一點？

Answer 1

如果總是有一定數量的記錄記錄並且它們是有序的，那么一次只讀幾行並解析並輸出它們就相當容易。 試圖對數十億條記錄進行正則表達式將需要永遠。 使用StreamReader和StreamWriter可以讀取和寫入這些大文件，因為它們一次讀寫一行。

using (StreamReader sr = new StreamReader("inputFile.txt")) 
using (StreamWriter sw = new StreamWriter("outputFile.txt"))
{
    string line1;
    int counter = 0;
    var lineCountToGroup = 3; //change to 96
    while ((line1 = sr.ReadLine()) != null) 
    {
        var lines = new List<string>();
        lines.Add(line1);
        for(int i = 0; i < lineCountToGroup - 1; i++) //less 1 because we already added line1
            lines.Add(sr.ReadLine());

        var groupedLine = lines.SomeLinqIfNecessary();//whatever your grouping logic is
        sw.WriteLine(groupedLine);
    }
}

免責聲明 - 未經測試的代碼，沒有錯誤處理，並假設確實有重復的行數，等等。您顯然需要對您的確切方案進行一些調整。

Answer 2

你可以這樣做（沒有任何錯誤處理的未經測試的代碼 - 但應該給你一般的要點）：

using (var sin = new SteamReader("yourfile.csv")
using (var sout = new SteamWriter("outfile.csv")
{
    var line = sin.ReadLine();    // note: should add error handling for empty files
    var cells = line.Split(",");  // note: you should probably check the length too!
    var key = cells[0];           // use this to match other rows
    StringBuilder output = new StringBuilder(line);   // this is the output line we build
    while ((line = sin.ReadLine()) != null) // if we have more lines
    {
        cells = line.Split(",");    // split so we can get the first column
        while(cells[0] == key)      // if the first column matches the current key
        {
            output.Append(String.Join(",",cells.Skip(4)));   // add this row to our output line
        }
        // once the key changes
        sout.WriteLine(output.ToString());      // write out the line we've built up
        output.Clear();
        output.Append(line);         // update the new line to build
        key = cells[0];              // and update the key
    }
    // once all lines have been processed
    sout.WriteLine(output.ToString());    // We'll have just the last line to write out
}

我們的想法是依次遍歷每一行並跟蹤第一列的當前值。 當該值發生變化時，您將寫出您正在構建的output行並更新key 。 這樣您就不必擔心您有多少匹配，或者您可能錯過了幾個點。

需要注意的是，如果要連接96行，使用StringBuilder進行output而不是String可能更有效。

Answer 3

定義ProcessOutputLine以存儲合並的行。 在每個ReadLine之后和文件末尾調用ProcessLine。

string curKey     =""   ; 
string keyLength  = ... ; // set totalength of 4 first columns
string outputLine = ""  ;

private void ProcessInputLine(string line)
{
  string newKey=line.substring(0,keyLength) ;
  if (newKey==curKey) outputline+=line.substring(keyLength) ;
  else 
  { 
    if (outputline!="") ProcessOutPutLine(outputLine)
    curkey = newKey ;
    outputLine=Line ;
}

編輯：這個解決方案與Matt Burland非常相似，唯一值得注意的區別是我沒有使用Split功能。

在巨大的文件中合並CSV行

問題描述

3 個解決方案

解決方案1
2 已采納 2015-07-09 13:31:50

解決方案2
1 2015-07-09 13:42:52

解決方案3
0 2015-07-09 13:45:04

在巨大的文件中合並CSV行

問題描述

3 個解決方案

解決方案1 2 已采納 2015-07-09 13:31:50

解決方案2 1 2015-07-09 13:42:52

解決方案3 0 2015-07-09 13:45:04

解決方案1
2 已采納 2015-07-09 13:31:50

解決方案2
1 2015-07-09 13:42:52

解決方案3
0 2015-07-09 13:45:04