简体   繁体   中英

Transfering huge amount of data from SQL Server into parquet file

Recently I have been challenged with the task to create a process, which extracts data from a SQL Server DB and writes it to parquet files. I searched online and found various examples, which load the data into a DataTable and then write the data via ParquetWriter into parquet files.

Following an excerpt from the code I'm currently testing:

        static void TestWriteParquet(string ConnectionString, string Query, string OutputFilePath, int rowGroupSize = 10000)
        {
            DataTable dt = GetData(ConnectionString, Query);
            var fields = GenerateSchema(dt);

            using (var stream = File.Open(OutputFilePath, FileMode.Create, FileAccess.Write))
            {
                using (var writer = new ParquetWriter(new Schema(fields), stream))
                {
                    var startRow = 0;

                    // Keep on creating row groups until we run out of data
                    while (startRow < dt.Rows.Count)
                    {
                        using (var rgw = writer.CreateRowGroup())
                        {
                            // Data is written to the row group column by column
                            for (var i = 0; i < dt.Columns.Count; i++)
                            {
                                var columnIndex = i;

                                // Determine the target data type for the column
                                var targetType = dt.Columns[columnIndex].DataType;
                                if (targetType == typeof(DateTime)) targetType = typeof(DateTimeOffset);

                                // Generate the value type, this is to ensure it can handle null values
                                var valueType = targetType.IsClass
                                ? targetType
                                : typeof(Nullable<>).MakeGenericType(targetType);

                                // Create a list to hold values of the required type for the column
                                var list = (IList)typeof(List<>)
                                .MakeGenericType(valueType)
                                .GetConstructor(Type.EmptyTypes)
                                .Invoke(null);

                                // Get the data to be written to the parquet stream
                                foreach (var row in dt.AsEnumerable().Skip(startRow).Take(rowGroupSize))
                                {
                                    // Check if value is null, if so then add a null value
                                    if (row[columnIndex] == null || row[columnIndex] == DBNull.Value)
                                    {
                                        list.Add(null);
                                    }
                                    else
                                    {
                                        // Add the value to the list, but if it’s a DateTime then create it as a DateTimeOffset first
                                        list.Add(dt.Columns[columnIndex].DataType == typeof(DateTime)
                                        ? new DateTimeOffset((DateTime)row[columnIndex])
                                        : row[columnIndex]);
                                    }
                                }

                                // Copy the list values to an array of the same type as the WriteColumn method expects
                                // and Array
                                var valuesArray = Array.CreateInstance(valueType, list.Count);
                                list.CopyTo(valuesArray, 0);

                                // Write the column
                                rgw.WriteColumn(new Parquet.Data.DataColumn(fields[i], valuesArray));
                            }
                        }

                        startRow += rowGroupSize;
                    }
                }
            }
        }

Given the fact that we are dealing with enormous tables, which will have to be splitted into several files, I wonder if there is a way to stream the data instead of loading it into the data table first? What would be an alternative to this approach?

Furthermore it would be good to know if the compression rate within the parquet is only dependent on the row group count or if there are any other ways to increase the compression rate?

I tested the process with a table with somehwere aroud 350k rows and it worked, but it was quite slow and additionally consumed quite a lot of memory. But considering that our largest table holds something like 200 billion rows and is at least 60 columns wide I doubt that this can be accomplished.

Using Cinchoo ETL , an open source library, you can create parquet file from database as below

using (var conn = new SqlConnection(@"*** CONNECTION STRING ***"))
{
    conn.Open();
    var cmd = new SqlCommand("SELECT * FROM TABLE", conn);

    var dr = cmd.ExecuteReader();

    using (var w = new ChoParquetWriter(@"*** PARQUET FILE PATH ***")
        .Configure(c => c.LiteParsing = true)
        .Configure(c => c.RowGroupSize = 5000)
        .NotifyAfter(100000)
        .OnRowsWritten((o, e) => $"Rows Loaded: {e.RowsWritten} <-- {DateTime.Now}".Print())
        )
    {
        w.Write(dr);
    }


}

For more information, pls check https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer article.

Sample fiddle : https://dotnetfiddle.net/Ra8yf4

Disclaimer: I'm auther of this library.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM