C＃并行写入Azure Data Lake File

Question

在我们的Azure Data Lake中，我们每天都有记录事件的文件以及这些事件的坐标。 我们需要获取这些坐标并查找这些坐标所属的州，县，乡镇和分区。 我尝试了此代码的多个版本。

我试图在U-SQL中执行此操作。 我什至上传了一个实现Microsoft.SqlServer.Types.SqlGeography方法的自定义程序集，只是发现未设置ADLA来执行像地理编码这样的逐行操作。
我将所有行都拉到SQL Server中，将坐标转换为SQLGeography并构建了将执行州，县等查询的T-SQL代码。 经过大量的优化后，我将该过程降低到了约700ms /行。 （每天有1300万行待办事项积淀，每天增加约1.6万行，我们要花将近3年的时间来追赶。所以我并行化了T-SQL，情况有所改善，但还不够。
我使用了T-SQL代码，并将该过程构建为控制台应用程序，因为SqlGeography库实际上是.Net库，而不是本机SQL Server产品。 我能够在t0〜500ms内获得单线程处理。 添加.Net的并行性（parallel.ForEach）并扔掉我机器的10/20内核可以起到很多作用，但还不够。
我试图将此代码重写为Azure函数，并逐个文件处理数据湖中的文件。 大多数文件都超时，因为它们花费了超过10分钟的时间来处理。 因此，我更新了代码以读取文件，并将行读入Azure Queue存储。 然后，我有了第二个Azure函数，它将为队列中的每一行触发。 这个想法是，Azure Functions的扩展能力远远超过任何一台计算机。

这就是我遇到的问题。 我无法可靠地将行写入ADLS中的文件。 这是我现在拥有的代码。

public static void WriteGeocodedOutput(string Contents, String outputFileName, ILogger log) {

        AdlsClient client = AdlsClient.CreateClient(ADlSAccountName, adlCreds);
        //if the file doesn't exist write the header first
        try {
            if (!client.CheckExists(outputFileName)) {
                using (var stream = client.CreateFile(outputFileName, IfExists.Fail)) {
                    byte[] headerByteArray = Encoding.UTF8.GetBytes("EventDate, Longitude, Latitude, RadarSiteID, CellID, RangeNauticalMiles, Azimuth, SevereProbability, Probability, MaxSizeinInchesInUS, StateCode, CountyCode, TownshipCode, RangeCode\r\n");
                    //stream.Write(headerByteArray, 0, headerByteArray.Length);
                    client.ConcurrentAppend(outputFileName, true, headerByteArray, 0, headerByteArray.Length);
                }
            }
        } catch (Exception e) {
            log.LogInformation("multiple attempts to create the file. Ignoring this error, since the file was created.");
        }

        //the write the data
        byte[] textByteArray = Encoding.UTF8.GetBytes(Contents);
        for (int attempt = 0; attempt < 5; attempt++) {
            try {
                log.LogInformation("prior to write, the outputfile size is: " + client.GetDirectoryEntry(outputFileName).Length);
                var offset = client.GetDirectoryEntry(outputFileName).Length;
                client.ConcurrentAppend(outputFileName, false, textByteArray, 0, textByteArray.Length);
                log.LogInformation("AFTER write, the outputfile size is: " + client.GetDirectoryEntry(outputFileName).Length);
                //if successful, stop trying to write this row
                attempt = 6;                    
            }
            catch (Exception e){
                log.LogInformation($"exception on adls write: {e}");
            }
            Random rnd = new Random();
            Thread.Sleep(rnd.Next(attempt * 60));
        }
    }

该文件将在需要时创建，但是我确实在日志中收到几条消息，其中有多个线程试图创建该文件。 我并不总是写标题行。

我也不再仅获得任何数据行：

"BadRequest ( IllegalArgumentException  concurrentappend failed with error 0xffffffff83090a6f 
(Bad request. The target file does not support this particular type of append operation. 
If the concurrent append operation has been used with this file in the past, you need to append to this file using the concurrent append operation.
If the append operation with offset has been used in the past, you need to append to this file using the append operation with offset. 
On the same file, it is not possible to use both of these operations.). []

我觉得这里缺少一些基本的设计思想。 该代码应尝试将行写入文件。 如果文件尚不存在，请创建该文件并将其放入标题行。然后，将该行放入。

完成这种写方案的最佳实践方法是什么？

关于如何在ADLS中处理这种并行写入工作负载还有其他建议吗？

Answer 1

我有点迟了，但是我想问题之一可能是由于在同一文件流上使用了“ Create”和“ ConcurrentAppend”？ ADLS文档提到它们不能在同一文件上使用。 也许尝试将“创建”命令更改为“ ConcurrentAppend”，因为后者不存在时可用于创建文件。

另外，如果您找到了更好的方法，请在此处发布您的解决方案。

C＃并行写入Azure Data Lake File

问题描述

1 个解决方案

解决方案1
0 2019-01-28 21:08:23

C＃并行写入Azure Data Lake File

问题描述

1 个解决方案

解决方案1 0 2019-01-28 21:08:23

解决方案1
0 2019-01-28 21:08:23