简体   繁体   中英

Parsing a huge text file(around 2GB) with custom delimiters

I have a huge text file around 2GB which I am trying to parse in C#. The file has custom delimiters for rows and columns. I want to parse the file and extract the data and write to another file by inserting column header and replacing RowDelimiter by newline and ColumnDelimiter by tab so that I can get the data in tabular format.

sample data:
1'~'2'~'3#####11'~'12'~'13

RowDelimiter: #####
ColumnDelimiter: '~'

I keep on getting System.OutOfMemoryException on the following line

while ((line = rdr.ReadLine()) != null)

public void ParseFile(string inputfile,string outputfile,string header)
{

    using (StreamReader rdr = new StreamReader(inputfile))
    {
        string line;

        while ((line = rdr.ReadLine()) != null)
        {
            using (StreamWriter sw = new StreamWriter(outputfile))
            {
                //Write the Header row
                sw.Write(header);

                //parse the file
                string[] rows = line.Split(new string[] { ParserConstants.RowSeparator },
                    StringSplitOptions.None);

                foreach (string row in rows)
                {
                    string[] columns = row.Split(new string[] {ParserConstants.ColumnSeparator},
                        StringSplitOptions.None);
                    foreach (string column in columns)
                    {
                        sw.Write(column + "\\t");
                    }
                    sw.Write(ParserConstants.NewlineCharacter);
                    Console.WriteLine();
                }
            }

            Console.WriteLine("File Parsing completed");

        }
    }
}

Read the data into a buffer and then do your parsing.

using (StreamReader rdr = new StreamReader(inputfile))
using (StreamWriter sw = new StreamWriter(outputfile))
{
    char[] buffer = new char[256];
    int read;

    //Write the Header row
    sw.Write(header);

    string remainder = string.Empty;
    while ((read = rdr.Read(buffer, 0, 256)) > 0)
    {
        string bufferData = new string(buffer, 0, read);
        //parse the file
        string[] rows = bufferData.Split(
            new string[] { ParserConstants.RowSeparator },
            StringSplitOptions.None);

        rows[0] = remainder + rows[0];
        int completeRows = rows.Length - 1;
        remainder = rows.Last();
        foreach (string row in rows.Take(completeRows))
        {
            string[] columns = row.Split(
                new string[] {ParserConstants.ColumnSeparator},
                StringSplitOptions.None);
            foreach (string column in columns)
            {
                sw.Write(column + "\\t");
            }
            sw.Write(ParserConstants.NewlineCharacter);
            Console.WriteLine();
        }
    }

    if(reamainder.Length > 0)
    {
        string[] columns = remainder.Split(
        new string[] {ParserConstants.ColumnSeparator},
        StringSplitOptions.None);
        foreach (string column in columns)
        {
            sw.Write(column + "\\t");
        }
        sw.Write(ParserConstants.NewlineCharacter);
        Console.WriteLine();
    }

    Console.WriteLine("File Parsing completed");
}

The problem you have is that you are eagerly consuming the whole file and placing it in memory. Attempting to split a 2GB file in memory is going to be problematic, as you now know.

Solution? Consume one lime a time. Because your file doesn't have a standard line separator you'll have to implement a custom parser that does this for you. The following code does just that (or I think it does, I haven't tested it). Its probably very improvable from a performance perspective but it should at least get you started in the right direction (c#7 syntax):

public static IEnumerable<string> GetRows(string path, string rowSeparator)
{
    bool tryParseSeparator(StreamReader reader, char[] buffer)
    {
        var count = reader.Read(buffer, 0, buffer.Length);

        if (count != buffer.Length)
            return false;

        return Enumerable.SequenceEqual(buffer, rowSeparator);
    }

    using (var reader = new StreamReader(path))
    {
        int peeked;
        var rowBuffer = new StringBuilder();
        var separatorBuffer = new char[rowSeparator.Length];

        while ((peeked = reader.Peek()) > -1)
        {
            if ((char)peeked == rowSeparator[0])
            {
                if (tryParseSeparator(reader, separatorBuffer))
                {
                    yield return rowBuffer.ToString();
                    rowBuffer.Clear();
                }
                else
                {
                    rowBuffer.Append(separatorBuffer);
                }
            }
            else
            {
                rowBuffer.Append((char)reader.Read());
            }
        }

        if (rowBuffer.Length > 0)
            yield return rowBuffer.ToString();
    }
}

Now you have a lazy enumeration of rows from your file, and you can process it as you intended to:

foreach (var row in GetRows(inputFile, ParserConstants.RowSeparator))
{
     var columns = line.Split(new string[] {ParserConstants.ColumnSeparator},
                              StringSplitOptions.None);
     //etc.
}

As mentioned already in the comments you won't be able to use ReadLine to handle this, you'll have to essentially process the data one byte - or character - at a time. The good news is that this is basically how ReadLine works anyway, so we're not losing a lot in this case.

Using a StreamReader we can read a series of characters from the source stream (in whatever encoding you need) into an array. Using that and a StringBuilder we can process the stream in chunks and check for separator sequences on the way.

Here's a method that will handle an arbitrary delimiter:

public static IEnumerable<string> ReadDelimitedRows(StreamReader reader, string delimiter)
{
    char[] delimChars = delimiter.ToArray();
    int matchCount = 0;
    char[] buffer = new char[512];
    int rc = 0;
    StringBuilder sb = new StringBuilder();

    while ((rc = reader.Read(buffer, 0, buffer.Length)) > 0)
    {
        for (int i = 0; i < rc; i++)
        {
            char c = buffer[i];
            if (c == delimChars[matchCount])
            {
                if (++matchCount >= delimChars.Length)
                {
                    // found full row delimiter
                    yield return sb.ToString();
                    sb.Clear();
                    matchCount = 0;
                }
            }
            else
            {
                if (matchCount > 0)
                {
                    // append previously matched portion of the delimiter
                    sb.Append(delimChars.Take(matchCount));
                    matchCount = 0;
                }
                sb.Append(c);
            }
        }
    }
    // return the last row if found
    if (sb.Length > 0)
        yield return sb.ToString();
}

This should handle any cases where part of your block delimiter can appear in the actual data.

In order to translate your file from the input format you describe to a simple tab-delimited format you could do something along these lines:

const string RowDelimiter = "#####";
const string ColumnDelimiter = "'~'";

using (var reader = new StreamReader(inputFilename))
using (var writer = new StreamWriter(File.Create(ouputFilename)))
{
    foreach (var row in ReadDelimitedRows(reader, RowDelimiter))
    {
        writer.Write(row.Replace(ColumnDelimiter, "\t"));
    }
}

That should process fairly quickly without eating up too much memory. Some adjustments might be required for non-ASCII output.

I think this should do the trick...

public void ParseFile(string inputfile, string outputfile, string header)
{
    int blockSize = 1024;

    using (var file = File.OpenRead(inputfile))
    {
        using (StreamWriter sw = new StreamWriter(outputfile))
        {
            int bytes = 0;
            int parsedBytes = 0;
            var buffer = new byte[blockSize];
            string lastRow = string.Empty;

            while ((bytes = file.Read(buffer, 0, buffer.Length)) > 0)
            {
                // Because the buffer edge could split a RowDelimiter, we need to keep the
                // last row from the prior split operation.  Append the new buffer to the
                // last row from the prior loop iteration.
                lastRow += Encoding.Default.GetString(buffer,0, bytes);

                //parse the file
                string[] rows = lastRow.Split(new string[] { ParserConstants.RowSeparator }, StringSplitOptions.None);

                // We cannot process the last row in this set because it may not be a complete
                // row, and tokens could be clipped.
                if (rows.Count() > 1)
                {
                    for (int i = 0; i < rows.Count() - 1; i++)
                    {
                        sw.Write(new Regex(ParserConstants.ColumnSeparator).Replace(rows[i], "\t") + ParserConstants.NewlineCharacter);
                    }
                }
                lastRow = rows[rows.Count() - 1];
                parsedBytes += bytes;
                // The following statement is not quite true because we haven't parsed the lastRow.
                Console.WriteLine($"Parsed {parsedBytes.ToString():N0} bytes");
            }
            // Now that there are no more bytes to read, we know that the lastrow is complete.
            sw.Write(new Regex(ParserConstants.ColumnSeparator).Replace(lastRow, "\t"));
        }
    }
    Console.WriteLine("File Parsing completed.");
}

Late to the party here, but in case anyone else want to know easy way to load such large CSV file with custom delimiters, Cinchoo ETL does the job for you.

using (var parser = new ChoCSVReader("CustomNewLine.csv")
    .WithDelimiter("~")
    .WithEOLDelimiter("#####")
    )
{
    foreach (dynamic x in parser)
        Console.WriteLine(x.DumpAsJson());
}

Disclaimer: I'm the author of this library.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM