简体   繁体   中英

Parsing and uploading >1GB of data in C#

I have written a program to parse and upload a huge amount of Data to a Database. The problem is that the parsing is WAY too slow. The way my program works is I have the Parser class which parses (using parallelisation) each file, and raises an event for each entry it parses in each file:

Parallel.ForEach<FileInfo>(
    files,
    new ParallelOptions { MaxDegreeOfParallelism = maxParallelism },
    (inputFile, args) =>
    {
        // Using underlying FileStream to allow concurrent Read/Write access.
        using (var input = new StreamReader(inputFile.FullName))
        {
            while (!input.EndOfStream)
            {
                RaiseEntryParsed(ParseCity(input.ReadLine()));
            }
            ParsedFiles++;
            RaiseFileParsed(inputFile);
        }
    });
RaiseDirectoryParsed(Directory);

The "main" program is subscribed to this event, and adds the entries to a DataTable to do a SqlBulkCopy; the SqlBulkCopy only submits when the parser class raises the FileParsed event (every time a file is parsed):

using (SqlBulkCopy bulkCopy = new SqlBulkCopy(_connectionString))
{
    DataTable cityTable = DataContext.CreateCityDataTable();
    parser.EntryParsed +=
        (s, e) =>
        {
            DataRow cityRow = cityTable.NewRow();
            City parsedCity = (City)e.DatabaseEntry;

            cityRow["id"] = parsedCity.Id;
            ...
            ...

            cityTable.Rows.Add(cityRow);
        };

    parser.FileParsed +=
        (s, e) =>
        {
            bulkCopy.WriteToServer(cityTable);
            Dispatcher.BeginInvoke((Action)UpdateProgress);
            cityTable.Rows.Clear();
        };

    parser.DirectoryParsed +=
        (s, e) =>
        {
            bulkCopy.WriteToServer(cityTable);
            Dispatcher.BeginInvoke((Action)UpdateProgress);
        };

    parser.BeginParsing();
}

The reason the table's rows are being cleared after each submission is to conserve memory and prevent an OutOfMemoryException from so many entities being in memory...

How can I make this faster, it is currently unacceptably slow. I profiled the application and it stated that most of the time is being spent on the Entryparsed event. Thanks

I made a short test project, and tried out a few different approaches. My goal was to build a DataTable with 27 columns and (id,A,B,C,...,Z) and NumOfRows about 300,000 as quickly as possible using just sequential code.

(each row is populated with an id and the rest of the columns are filled with random 5-letter words).

On my fourth attempt, I stumbled upon a different syntax for adding the row to the table based on an array of values of type Object. (see here ).

In your case it would be something like:

cityTable.Rows.Add( new Object[] {

  ((City)e.DatabaseEntry).Id ,

  ObjectThatGoesInColumn2    ,

  ObjectThatGoesInColumn3    ,

  ObjectThatGoesInLastColumn

}

instead of:

DataRow row = cityTable.NewRow();

row[0] = 100;
row["City Name"] = Anaheim;
row["Column 7"] = ...
...
row["Column 26"] = checksum;

workTable.Rows.Add( row );

This will give you a speed-up since you won't be individually setting each column one at a time, and based on your pic of the profiler, you have at least 12 columns that you were setting individually.

This also saves it from hashing the column name strings to see which array position you are dealing with, and then double checking that the data type is correct.

In case you are interested, here is my test project:

class Program
{
    public static System.Data.DataSet dataSet;
    public static System.Data.DataSet dataSet2;
    public static System.Data.DataSet dataSet3;
    public static System.Data.DataSet dataSet4;

    public static Random rand = new Random();

    public static int NumOfRows = 300000;

    static void Main(string[] args)
    {

        #region test1

        Console.WriteLine("Starting");

        Console.WriteLine("");

        Stopwatch watch = new Stopwatch();

        watch.Start();

        MakeTable();

        watch.Stop();

        Console.WriteLine("Elapsed Time was: " + watch.ElapsedMilliseconds + " milliseconds.");
        dataSet = null;

        Console.WriteLine("");

        Console.WriteLine("Completed.");

        Console.WriteLine("");

        #endregion

        /*

        #region test2


        Console.WriteLine("Starting Test 2");

        Console.WriteLine("");

        watch.Reset();

        watch.Start();

        MakeTable2();

        watch.Stop();

        Console.WriteLine("Elapsed Time was: " + watch.ElapsedMilliseconds + " milliseconds.");
        dataSet2 = null;

        Console.WriteLine("");

        Console.WriteLine("Completed Test 2.");

        #endregion


        #region test3
        Console.WriteLine("");

        Console.WriteLine("Starting Test 3");

        Console.WriteLine("");

        watch.Reset();

        watch.Start();

        MakeTable3();

        watch.Stop();

        Console.WriteLine("Elapsed Time was: " + watch.ElapsedMilliseconds + " milliseconds.");
        dataSet3 = null;

        Console.WriteLine("");

        Console.WriteLine("Completed Test 3.");

        #endregion

         */ 

        #region test4
        Console.WriteLine("Starting Test 4");

        Console.WriteLine("");

        watch.Reset();

        watch.Start();

        MakeTable4();

        watch.Stop();

        Console.WriteLine("Elapsed Time was: " + watch.ElapsedMilliseconds + " milliseconds.");
        dataSet4 = null;

        Console.WriteLine("");

        Console.WriteLine("Completed Test 4.");

        #endregion


        //printTable();

        Console.WriteLine("");
        Console.WriteLine("Press Enter to Exit...");

        Console.ReadLine();
    }

    private static void MakeTable()
    {
        DataTable table = new DataTable("Table 1");

        DataColumn column;
        DataRow row;

        column = new DataColumn();
        column.DataType = System.Type.GetType("System.Int32");
        column.ColumnName = "id";
        column.ReadOnly = true;
        column.Unique = true;

        table.Columns.Add(column);


        for (int i = 65; i <= 90; i++)
        {
            column = new DataColumn();
            column.DataType = System.Type.GetType("System.String");
            column.ColumnName = "5-Letter Word " + (char)i;
            column.AutoIncrement = false;
            column.Caption = "Random Word " + (char)i;
            column.ReadOnly = false;
            column.Unique = false;
            // Add the column to the table.
            table.Columns.Add(column);
        }

        DataColumn[] PrimaryKeyColumns = new DataColumn[1];
        PrimaryKeyColumns[0] = table.Columns["id"];
        table.PrimaryKey = PrimaryKeyColumns;

        // Instantiate the DataSet variable.
        dataSet = new DataSet();
        // Add the new DataTable to the DataSet.
        dataSet.Tables.Add(table);

        // Create three new DataRow objects and add 
        // them to the DataTable
        for (int i = 0; i < NumOfRows; i++)
        {
            row = table.NewRow();
            row["id"] = i;

            for (int j = 65; j <= 90; j++)
            {
                row["5-Letter Word " + (char)j] = getRandomWord();
            }

            table.Rows.Add(row);
        }

    }

    private static void MakeTable2()
    {
        DataTable table = new DataTable("Table 2");

        DataColumn column;
        DataRow row;

        column = new DataColumn();
        column.DataType = System.Type.GetType("System.Int32");
        column.ColumnName = "id";
        column.ReadOnly = true;
        column.Unique = true;

        table.Columns.Add(column);


        for (int i = 65; i <= 90; i++)
        {
            column = new DataColumn();
            column.DataType = System.Type.GetType("System.String");
            column.ColumnName = "5-Letter Word " + (char)i;
            column.AutoIncrement = false;
            column.Caption = "Random Word " + (char)i;
            column.ReadOnly = false;
            column.Unique = false;
            // Add the column to the table.
            table.Columns.Add(column);
        }

        DataColumn[] PrimaryKeyColumns = new DataColumn[1];
        PrimaryKeyColumns[0] = table.Columns["id"];
        table.PrimaryKey = PrimaryKeyColumns;

        // Instantiate the DataSet variable.
        dataSet2 = new DataSet();
        // Add the new DataTable to the DataSet.
        dataSet2.Tables.Add(table);

        // Create three new DataRow objects and add 
        // them to the DataTable
        for (int i = 0; i < NumOfRows; i++)
        {
            row = table.NewRow();

            row.BeginEdit();

            row["id"] = i;

            for (int j = 65; j <= 90; j++)
            {
                row["5-Letter Word " + (char)j] = getRandomWord();
            }

            row.EndEdit();

            table.Rows.Add(row);
        }

    }

    private static void MakeTable3()
    {
        DataTable table = new DataTable("Table 3");

        DataColumn column;

        column = new DataColumn();
        column.DataType = System.Type.GetType("System.Int32");
        column.ColumnName = "id";
        column.ReadOnly = true;
        column.Unique = true;

        table.Columns.Add(column);


        for (int i = 65; i <= 90; i++)
        {
            column = new DataColumn();
            column.DataType = System.Type.GetType("System.String");
            column.ColumnName = "5-Letter Word " + (char)i;
            column.AutoIncrement = false;
            column.Caption = "Random Word " + (char)i;
            column.ReadOnly = false;
            column.Unique = false;
            // Add the column to the table.
            table.Columns.Add(column);
        }

        DataColumn[] PrimaryKeyColumns = new DataColumn[1];
        PrimaryKeyColumns[0] = table.Columns["id"];
        table.PrimaryKey = PrimaryKeyColumns;

        // Instantiate the DataSet variable.
        dataSet3 = new DataSet();
        // Add the new DataTable to the DataSet.
        dataSet3.Tables.Add(table);


        DataRow[] newRows = new DataRow[NumOfRows];

        for (int i = 0; i < NumOfRows; i++)
        {
            newRows[i] = table.NewRow();
        }

        // Create three new DataRow objects and add 
        // them to the DataTable
        for (int i = 0; i < NumOfRows; i++)
        {

            newRows[i]["id"] = i;

            for (int j = 65; j <= 90; j++)
            {
                newRows[i]["5-Letter Word " + (char)j] = getRandomWord();
            }

            table.Rows.Add(newRows[i]);
        }

    }

    private static void MakeTable4()
    {
        DataTable table = new DataTable("Table 2");

        DataColumn column;

        column = new DataColumn();
        column.DataType = System.Type.GetType("System.Int32");
        column.ColumnName = "id";
        column.ReadOnly = true;
        column.Unique = true;

        table.Columns.Add(column);


        for (int i = 65; i <= 90; i++)
        {
            column = new DataColumn();
            column.DataType = System.Type.GetType("System.String");
            column.ColumnName = "5-Letter Word " + (char)i;
            column.AutoIncrement = false;
            column.Caption = "Random Word " + (char)i;
            column.ReadOnly = false;
            column.Unique = false;
            // Add the column to the table.
            table.Columns.Add(column);
        }

        DataColumn[] PrimaryKeyColumns = new DataColumn[1];
        PrimaryKeyColumns[0] = table.Columns["id"];
        table.PrimaryKey = PrimaryKeyColumns;

        // Instantiate the DataSet variable.
        dataSet4 = new DataSet();
        // Add the new DataTable to the DataSet.
        dataSet4.Tables.Add(table);

        // Create three new DataRow objects and add 
        // them to the DataTable
        for (int i = 0; i < NumOfRows; i++)
        {

            table.Rows.Add( 

                new Object[] {

                    i,

                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord(),
                    getRandomWord()

                } 

            );
        }

    }



    private static string getRandomWord()
    {

        char c0 = (char)rand.Next(65, 90);
        char c1 = (char)rand.Next(65, 90);
        char c2 = (char)rand.Next(65, 90);
        char c3 = (char)rand.Next(65, 90);
        char c4 = (char)rand.Next(65, 90);

        return "" + c0 + c1 + c2 + c3 + c4;
    }

    private static void printTable()
    {
        foreach (DataRow row in dataSet.Tables[0].Rows)
        {
            Console.WriteLine( row["id"] + "--" + row["5-Letter Word A"] + " - " + row["5-Letter Word Z"] );
        }
    }


}

I haven't really looked at your parallelism yet, but there are a couple of things.

First, change "ParsedFiles++;" to "Interlocked.Increment( ref ParsedFiles);", or by locking around it.

Secondly, instead of the complicated event-driven parallelism, I would recommend using a Pipeline Pattern, which is quite suited to this.

Use a concurrent queue (or blocking collection) from the concurrent collections to hold the stages.

First stage will hold the list of files to process.

Worker tasks will dequeue a file from that work list, parse it, then add it to the second stage.

In the second stage, a worker task will take the items from the second stage queue (the just completed blocks of datatable) and upload them to the database the moment they are ready to upload.


Edit:

I wrote a Pipelined version of code which should help you on your way:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Collections.Concurrent;
using System.Threading.Tasks;
using System.IO;
using System.Data;

namespace dataTableTesting2
{
    class Program
    {          
        private static const int BufferSize = 20; //Each buffer can only contain this many elements at a time
                                                  //This limits the total amount of memory 

        private static const int MaxBlockSize = 100;

        private static BlockingCollection<string> buffer1 = new BlockingCollection<string>(BufferSize);

        private static BlockingCollection<string[]> buffer2 = new BlockingCollection<string[]>(BufferSize);

        private static BlockingCollection<Object[][]> buffer3 = new BlockingCollection<Object[][]>(BufferSize);

        /// <summary>
        /// Start Pipelines and wait for them to finish.
        /// </summary>
        static void Main(string[] args)
        {
            TaskFactory f = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.None);

            Task stage0 = f.StartNew(() => PopulateFilesList(buffer1));
            Task stage1 = f.StartNew(() => ReadFiles(buffer1, buffer2));
            Task stage2 = f.StartNew(() => ParseStringBlocks(buffer2, buffer3));
            Task stage3 = f.StartNew(() => UploadBlocks(buffer3) );

            Task.WaitAll(stage0, stage1, stage2, stage3);

            /*
            // Note for more workers on particular stages you can make more tasks for each stage, like the following
            //    which populates the file list in 1 task, reads the files into string[] blocks in 1 task,
            //    then parses the string[] blocks in 4 concurrent tasks
            //    and lastly uploads the info in 2 tasks

            TaskFactory f = new TaskFactory(TaskCreationOptions.LongRunning, TaskContinuationOptions.None);

            Task stage0 = f.StartNew(() => PopulateFilesList(buffer1));
            Task stage1 = f.StartNew(() => ReadFiles(buffer1, buffer2));

            Task stage2a = f.StartNew(() => ParseStringBlocks(buffer2, buffer3));
            Task stage2b = f.StartNew(() => ParseStringBlocks(buffer2, buffer3));
            Task stage2c = f.StartNew(() => ParseStringBlocks(buffer2, buffer3));
            Task stage2d = f.StartNew(() => ParseStringBlocks(buffer2, buffer3));

            Task stage3a = f.StartNew(() => UploadBlocks(buffer3) );
            Task stage3b = f.StartNew(() => UploadBlocks(buffer3) );

            Task.WaitAll(stage0, stage1, stage2a, stage2b, stage2c, stage2d, stage3a, stage3b);

            */
        }

        /// <summary>
        /// Adds the filenames to process into the first pipeline
        /// </summary>
        /// <param name="output"></param>
        private static void PopulateFilesList( BlockingCollection<string> output )
        {
            try
            {
                buffer1.Add("file1.txt");
                buffer1.Add("file2.txt");
                //...
                buffer1.Add("lastFile.txt");
            }
            finally
            {
                output.CompleteAdding();
            }
        }

        /// <summary>
        /// Takes filnames out of the first pipeline, reads them into string[] blocks, and puts them in the second pipeline
        /// </summary>
        private static void ReadFiles( BlockingCollection<string> input, BlockingCollection<string[]> output)
        {
            try
            {
                foreach (string file in input.GetConsumingEnumerable())
                {
                    List<string> list = new List<string>(MaxBlockSize);

                    using (StreamReader sr = new StreamReader(file))
                    {
                        int countLines = 0;

                        while (!sr.EndOfStream)
                        {
                            list.Add( sr.ReadLine() );
                            countLines++;

                            if (countLines > MaxBlockSize)
                            {
                                output.Add(list.ToArray());
                                countLines = 0;
                                list = new List<string>(MaxBlockSize);
                            }
                        }

                        if (list.Count > 0)
                        {
                            output.Add(list.ToArray());
                        }
                    }
                }
            }

            finally
            {
                output.CompleteAdding();
            }
        }

        /// <summary>
        /// Takes string[] blocks from the second pipeline, for each line, splits them by tabs, and parses
        /// the data, storing each line as an object array into the third pipline.
        /// </summary>
        private static void ParseStringBlocks( BlockingCollection<string[]> input, BlockingCollection< Object[][] > output)
        {
            try
            {
                List<Object[]> result = new List<object[]>(MaxBlockSize);

                foreach (string[] block in input.GetConsumingEnumerable())
                {
                    foreach (string line in block)
                    {
                        string[] splitLine = line.Split('\t'); //split line on tab

                        string cityName = splitLine[0];
                        int cityPop = Int32.Parse( splitLine[1] );
                        int cityElevation = Int32.Parse(splitLine[2]);
                        //...

                        result.Add(new Object[] { cityName, cityPop, cityElevation });
                    }

                    output.Add( result.ToArray() );
                }
            }

            finally
            {
                output.CompleteAdding();
            }
        }

        /// <summary>
        /// Takes the data blocks from the third pipeline, and uploads each row to SQL Database
        /// </summary>
        private static void UploadBlocks(BlockingCollection<Object[][]> input)
        {
            /*
             * At this point 'block' is an array of object arrays.
             * 
             * The block contains MaxBlockSize number of cities.
             * 
             * There is one object array for each city.
             * 
             * The object array for the city is in the pre-defined order from pipeline stage2
             * 
             * You could do a couple of things at this point:
             * 
             * 1. declare and initialize a DataTable with the correct column types
             *    then, do the  dataTable.Rows.Add( rowValues )
             *    then, use a Bulk Copy Operation to upload the dataTable to SQL
             *    http://msdn.microsoft.com/en-us/library/7ek5da1a
             * 
             * 2. Manually perform the sql commands/transactions similar to what 
             *    Kevin recommends in this suggestion:
             *    http://stackoverflow.com/questions/1024123/sql-insert-one-row-or-multiple-rows-data/1024195#1024195
             * 
             * I've demonstrated the first approach with this code.
             * 
             * */


            DataTable dataTable = new DataTable();

            //set up columns of dataTable here.

            foreach (Object[][] block in input.GetConsumingEnumerable())
            {
                foreach (Object[] rowValues in block)
                {

                    dataTable.Rows.Add(rowValues);
                }

                //do bulkCopy to upload table containing MaxBlockSize number of cities right here.

                dataTable.Rows.Clear(); //Remove the rows when you are done uploading, but not the dataTable.
            }
        }

    }
}

It breaks the work up into 4 parts which can be done by different tasks:

  1. make a list of files to process

  2. take files from that list and read them into string[]'s

  3. take the string[]'s from previous part and parse them creating object[]'s containing the values for each row of the table

  4. upload the rows to database

It is also easy to assign more than 1 task to each phase, allowing multiple workers to execute the same pipeline stage if desired.

(I doubt that having more than 1 task reading from file would be useful, unless you are using a solid state drive, since jumping around in memory is quite slow).

Also, you can set a limit to the amount of data in memory through the execution of the program.

Each buffer is a BlockingCollection initialized with a max size, which means that if the buffer is full, and another task tries to add another element it will block that task.

Fortunately, the Task Parallel Library is smart, and if a task is blocked it will schedule a different task that isn't blocked, and check later to see if the first task has stopped being blocked.

At present each buffer can only hold 20 items, and each item is only 100 large, meaning that:

  • buffer1 will contain up to 20 filenames at any time.

  • buffer2 will contain up to 20 blocks of strings (composed of 100 lines) from those files, at any time.

  • buffer3 will contain up to 20 items of blocks of data (the object values for 100 cities) at any time.

So this would take enough memory to hold 20 filenames, 2000 lines from the files, and 2000 city informations. (With a bit extra for local variables and such).

You will likely want to increase BufferSize and MaxBlockSize for efficiency, although as is, this should work.

Note, I haven't tested this, as I didn't have any input files, so there could be some bugs.

Whilst I agree with some of the other comments and answers have you tried doing a:

cityTable.Rows.BeginEdit() 

before the first item is added to city table.

Then calling a:

cityTable.Rows.EndEdit()

in the FileParased event handler.

If you're looking for raw performance, wouldn't something like this be the best option? It completely bypasses the datatable code, which seems to be an unnecessary step.

void BulkInsertFile(string fileName, string tableName)
    {
        FileInfo info = new FileInfo(fileName);
        string name = info.Name;
        string shareDirectory = ""; //the path of the share: \\servername\shareName\
        string serverDirectory = ""; //the local path of the share on the server: C:\shareName\

        File.Copy(fileName, shareDirectory + name);
        // or you could call your method to parse the file and write it to the share directory.

        using (SqlConnection cnn = new SqlConnection("connectionString"))
        {
            cnn.Open();
            using (SqlCommand cmd = cnn.CreateCommand())
            {
                cmd.CommandText = string.Format("bulk insert {0} from '{1}' with (fieldterminator = ',', rowterminator = '\n')", tableName, serverDirectory + name);

                try
                {
                    cmd.ExecuteScalar();
                }
                catch (SqlException ex)
                {
                    MessageBox.Show(ex.Message);
                }
            }
        }
    }

Here is some information on the bulk insert command.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM