简体   繁体   中英

Merging CSV lines in huge file

I have a CSV that looks like this

783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-01 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:15,1,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:30,2,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:15,1,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582893T,2014-01-02 00:30,2,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y

although there are 5 billion records. If you notice the first column and part of the 2nd column (the day), three of the records are all 'grouped' together and are just a breakdown of 15 minute intervals for the first 30 minutes of that day.

I want the output to look like

783582893T,2014-01-01 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
783582855T,2014-01-01 00:00,0,128,35.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y
...
783582893T,2014-01-02 00:00,0,124,29.1,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y,40.0,0.0,40,40,5,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,40,Y

Where the first 4 columns of the repeating rows are ommitted and the rest of the columns are combined with the first record of it's kind. Basically I am converting the day from being each line is 15 minutes, to each line is 1 day.

Since I will be processing 5 billion records, I think the best thing is to use regular expressions (and EmEditor) or some tool that is made for this (multithreading, optimized), rather than a custom programmed solution. Althought I am open to ideas in nodeJS or C# that are relatively simple and super quick.

How can this be done?

If there's always a set number of records records and they're in order, it'd be fairly easy to just read a few lines at a time and parse and output them. Trying to do regex on billions of records would take forever. Using StreamReader and StreamWriter should make it possible to read and write these large files since they read and write one line at a time.

using (StreamReader sr = new StreamReader("inputFile.txt")) 
using (StreamWriter sw = new StreamWriter("outputFile.txt"))
{
    string line1;
    int counter = 0;
    var lineCountToGroup = 3; //change to 96
    while ((line1 = sr.ReadLine()) != null) 
    {
        var lines = new List<string>();
        lines.Add(line1);
        for(int i = 0; i < lineCountToGroup - 1; i++) //less 1 because we already added line1
            lines.Add(sr.ReadLine());

        var groupedLine = lines.SomeLinqIfNecessary();//whatever your grouping logic is
        sw.WriteLine(groupedLine);
    }
}

Disclaimer- untested code with no error handling and assuming that there are indeed the correct number of lines repeated, etc. You'd obviously need to do some tweaks for your exact scenario.

You could do something like this (untested code without any error handling - but should give you the general gist of it):

using (var sin = new SteamReader("yourfile.csv")
using (var sout = new SteamWriter("outfile.csv")
{
    var line = sin.ReadLine();    // note: should add error handling for empty files
    var cells = line.Split(",");  // note: you should probably check the length too!
    var key = cells[0];           // use this to match other rows
    StringBuilder output = new StringBuilder(line);   // this is the output line we build
    while ((line = sin.ReadLine()) != null) // if we have more lines
    {
        cells = line.Split(",");    // split so we can get the first column
        while(cells[0] == key)      // if the first column matches the current key
        {
            output.Append(String.Join(",",cells.Skip(4)));   // add this row to our output line
        }
        // once the key changes
        sout.WriteLine(output.ToString());      // write out the line we've built up
        output.Clear();
        output.Append(line);         // update the new line to build
        key = cells[0];              // and update the key
    }
    // once all lines have been processed
    sout.WriteLine(output.ToString());    // We'll have just the last line to write out
}

The idea is to loop through each line in turn and keep track of the current value of the first column. When that value changes, you write out the output line you've been building up and update the key . This way you don't have to worry about exactly how many matches you have or if you might be missing a few points.

One note, it might be more efficient to use a StringBuilder for output rather than a String if you are going to concatentate 96 rows.

Define the ProcessOutputLine to store merged lines. Call ProcessLine after each ReadLine and at end of file.

string curKey     =""   ; 
string keyLength  = ... ; // set totalength of 4 first columns
string outputLine = ""  ;

private void ProcessInputLine(string line)
{
  string newKey=line.substring(0,keyLength) ;
  if (newKey==curKey) outputline+=line.substring(keyLength) ;
  else 
  { 
    if (outputline!="") ProcessOutPutLine(outputLine)
    curkey = newKey ;
    outputLine=Line ;
}

EDIT : this solution is very similar to that of Matt Burland , the only noticable difference is that I don't use the Split function.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM