简体   繁体   English

将大量数据从 SQL 服务器传输到 parquet 文件中

[英]Transfering huge amount of data from SQL Server into parquet file

Recently I have been challenged with the task to create a process, which extracts data from a SQL Server DB and writes it to parquet files.最近,我面临着创建一个进程的任务,该进程从 SQL 服务器数据库中提取数据并将其写入 parquet 文件。 I searched online and found various examples, which load the data into a DataTable and then write the data via ParquetWriter into parquet files.我在网上搜索并找到了各种示例,它们将数据加载到 DataTable 中,然后通过 ParquetWriter 将数据写入 parquet 文件。

Following an excerpt from the code I'm currently testing:摘自我目前正在测试的代码:

        static void TestWriteParquet(string ConnectionString, string Query, string OutputFilePath, int rowGroupSize = 10000)
        {
            DataTable dt = GetData(ConnectionString, Query);
            var fields = GenerateSchema(dt);

            using (var stream = File.Open(OutputFilePath, FileMode.Create, FileAccess.Write))
            {
                using (var writer = new ParquetWriter(new Schema(fields), stream))
                {
                    var startRow = 0;

                    // Keep on creating row groups until we run out of data
                    while (startRow < dt.Rows.Count)
                    {
                        using (var rgw = writer.CreateRowGroup())
                        {
                            // Data is written to the row group column by column
                            for (var i = 0; i < dt.Columns.Count; i++)
                            {
                                var columnIndex = i;

                                // Determine the target data type for the column
                                var targetType = dt.Columns[columnIndex].DataType;
                                if (targetType == typeof(DateTime)) targetType = typeof(DateTimeOffset);

                                // Generate the value type, this is to ensure it can handle null values
                                var valueType = targetType.IsClass
                                ? targetType
                                : typeof(Nullable<>).MakeGenericType(targetType);

                                // Create a list to hold values of the required type for the column
                                var list = (IList)typeof(List<>)
                                .MakeGenericType(valueType)
                                .GetConstructor(Type.EmptyTypes)
                                .Invoke(null);

                                // Get the data to be written to the parquet stream
                                foreach (var row in dt.AsEnumerable().Skip(startRow).Take(rowGroupSize))
                                {
                                    // Check if value is null, if so then add a null value
                                    if (row[columnIndex] == null || row[columnIndex] == DBNull.Value)
                                    {
                                        list.Add(null);
                                    }
                                    else
                                    {
                                        // Add the value to the list, but if it’s a DateTime then create it as a DateTimeOffset first
                                        list.Add(dt.Columns[columnIndex].DataType == typeof(DateTime)
                                        ? new DateTimeOffset((DateTime)row[columnIndex])
                                        : row[columnIndex]);
                                    }
                                }

                                // Copy the list values to an array of the same type as the WriteColumn method expects
                                // and Array
                                var valuesArray = Array.CreateInstance(valueType, list.Count);
                                list.CopyTo(valuesArray, 0);

                                // Write the column
                                rgw.WriteColumn(new Parquet.Data.DataColumn(fields[i], valuesArray));
                            }
                        }

                        startRow += rowGroupSize;
                    }
                }
            }
        }

Given the fact that we are dealing with enormous tables, which will have to be splitted into several files, I wonder if there is a way to stream the data instead of loading it into the data table first?鉴于我们正在处理必须将其拆分为多个文件的巨大表,我想知道是否有办法将数据 stream 而不是先将其加载到数据表中? What would be an alternative to this approach?这种方法有什么替代方法?

Furthermore it would be good to know if the compression rate within the parquet is only dependent on the row group count or if there are any other ways to increase the compression rate?此外,最好知道镶木地板内的压缩率是否仅取决于行组数,或者是否有任何其他方法可以提高压缩率?

I tested the process with a table with somehwere aroud 350k rows and it worked, but it was quite slow and additionally consumed quite a lot of memory.我用一个大约有 350k 行的表测试了这个过程,它可以工作,但它很慢,而且还消耗了很多 memory。 But considering that our largest table holds something like 200 billion rows and is at least 60 columns wide I doubt that this can be accomplished.但是考虑到我们最大的表包含大约 2000 亿行并且至少有 60 列宽,我怀疑这是否可以实现。

Using Cinchoo ETL , an open source library, you can create parquet file from database as below使用开源库Cinchoo ETL ,您可以从数据库创建 parquet 文件,如下所示

using (var conn = new SqlConnection(@"*** CONNECTION STRING ***"))
{
    conn.Open();
    var cmd = new SqlCommand("SELECT * FROM TABLE", conn);

    var dr = cmd.ExecuteReader();

    using (var w = new ChoParquetWriter(@"*** PARQUET FILE PATH ***")
        .Configure(c => c.LiteParsing = true)
        .Configure(c => c.RowGroupSize = 5000)
        .NotifyAfter(100000)
        .OnRowsWritten((o, e) => $"Rows Loaded: {e.RowsWritten} <-- {DateTime.Now}".Print())
        )
    {
        w.Write(dr);
    }


}

For more information, pls check https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer article.有关更多信息,请查看https://www.codeproject.com/Articles/5271468/Cinchoo-ETL-Parquet-Writer文章。

Sample fiddle : https://dotnetfiddle.net/Ra8yf4小提琴样本https://dotnetfiddle.net/Ra8yf4

Disclaimer: I'm auther of this library.免责声明:我是这个图书馆的作者。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM