簡體   English   中英

在巨大的文件中合並CSV行

[英]Merging CSV lines in huge file

我有一個看起來像這樣的CSV

783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:15,1,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:30,2,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y

雖然有50億條記錄。 如果您注意到第一列和第二列的一部分(當天),則其中三個記錄全部“分組”在一起,並且只是當天前30分鍾的15分鍾間隔的細分。

我希望輸出看起來像

783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y

省略重復行的前4列,其余列與其類型的第一個記錄組合。 基本上我轉換的每一行是從每行15分鍾到每行是1天。

由於我將處理50億條記錄,我認為最好的方法是使用正則表達式(和EmEditor)或為此制作的一些工具(多線程,優化),而不是自定義編程解決方案。 雖然我對nodeJS或C#中相對簡單且超快的想法持開放態度。

如何才能做到這一點?

如果總是有一定數量的記錄記錄並且它們是有序的,那么一次只讀幾行並解析並輸出它們就相當容易。 試圖對數十億條記錄進行正則表達式將需要永遠。 使用StreamReaderStreamWriter可以讀取和寫入這些大文件,因為它們一次讀寫一行。

using (StreamReader sr = new StreamReader("inputFile.txt")) 
using (StreamWriter sw = new StreamWriter("outputFile.txt"))
{
    string line1;
    int counter = 0;
    var lineCountToGroup = 3; //change to 96
    while ((line1 = sr.ReadLine()) != null) 
    {
        var lines = new List<string>();
        lines.Add(line1);
        for(int i = 0; i < lineCountToGroup - 1; i++) //less 1 because we already added line1
            lines.Add(sr.ReadLine());

        var groupedLine = lines.SomeLinqIfNecessary();//whatever your grouping logic is
        sw.WriteLine(groupedLine);
    }
}

免責聲明 - 未經測試的代碼,沒有錯誤處理,並假設確實有重復的行數,等等。您顯然需要對您的確切方案進行一些調整。

你可以這樣做(沒有任何錯誤處理的未經測試的代碼 - 但應該給你一般的要點):

using (var sin = new SteamReader("yourfile.csv")
using (var sout = new SteamWriter("outfile.csv")
{
    var line = sin.ReadLine();    // note: should add error handling for empty files
    var cells = line.Split(",");  // note: you should probably check the length too!
    var key = cells[0];           // use this to match other rows
    StringBuilder output = new StringBuilder(line);   // this is the output line we build
    while ((line = sin.ReadLine()) != null) // if we have more lines
    {
        cells = line.Split(",");    // split so we can get the first column
        while(cells[0] == key)      // if the first column matches the current key
        {
            output.Append(String.Join(",",cells.Skip(4)));   // add this row to our output line
        }
        // once the key changes
        sout.WriteLine(output.ToString());      // write out the line we've built up
        output.Clear();
        output.Append(line);         // update the new line to build
        key = cells[0];              // and update the key
    }
    // once all lines have been processed
    sout.WriteLine(output.ToString());    // We'll have just the last line to write out
}

我們的想法是依次遍歷每一行並跟蹤第一列的當前值。 當該值發生變化時,您將寫出您正在構建的output行並更新key 這樣您就不必擔心您有多少匹配,或者您可能錯過了幾個點。

需要注意的是,如果要連接96行,使用StringBuilder進行output而不是String可能更有效。

定義ProcessOutputLine以存儲合並的行。 在每個ReadLine之后和文件末尾調用ProcessLine。

string curKey     =""   ; 
string keyLength  = ... ; // set totalength of 4 first columns
string outputLine = ""  ;

private void ProcessInputLine(string line)
{
  string newKey=line.substring(0,keyLength) ;
  if (newKey==curKey) outputline+=line.substring(keyLength) ;
  else 
  { 
    if (outputline!="") ProcessOutPutLine(outputLine)
    curkey = newKey ;
    outputLine=Line ;
}

編輯:這個解決方案與Matt Burland非常相似,唯一值得注意的區別是我沒有使用Split功能。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM