C＃並行寫入Azure Data Lake File

Question

在我們的Azure Data Lake中，我們每天都有記錄事件的文件以及這些事件的坐標。 我們需要獲取這些坐標並查找這些坐標所屬的州，縣，鄉鎮和分區。 我嘗試了此代碼的多個版本。

我試圖在U-SQL中執行此操作。 我什至上傳了一個實現Microsoft.SqlServer.Types.SqlGeography方法的自定義程序集，只是發現未設置ADLA來執行像地理編碼這樣的逐行操作。
我將所有行都拉到SQL Server中，將坐標轉換為SQLGeography並構建了將執行州，縣等查詢的T-SQL代碼。 經過大量的優化后，我將該過程降低到了約700ms /行。 （每天有1300萬行待辦事項積淀，每天增加約1.6萬行，我們要花將近3年的時間來追趕。所以我並行化了T-SQL，情況有所改善，但還不夠。
我使用了T-SQL代碼，並將該過程構建為控制台應用程序，因為SqlGeography庫實際上是.Net庫，而不是本機SQL Server產品。 我能夠在t0〜500ms內獲得單線程處理。 添加.Net的並行性（parallel.ForEach）並扔掉我機器的10/20內核可以起到很多作用，但還不夠。
我試圖將此代碼重寫為Azure函數，並逐個文件處理數據湖中的文件。 大多數文件都超時，因為它們花費了超過10分鍾的時間來處理。 因此，我更新了代碼以讀取文件，並將行讀入Azure Queue存儲。 然后，我有了第二個Azure函數，它將為隊列中的每一行觸發。 這個想法是，Azure Functions的擴展能力遠遠超過任何一台計算機。

這就是我遇到的問題。 我無法可靠地將行寫入ADLS中的文件。 這是我現在擁有的代碼。

public static void WriteGeocodedOutput(string Contents, String outputFileName, ILogger log) {

        AdlsClient client = AdlsClient.CreateClient(ADlSAccountName, adlCreds);
        //if the file doesn't exist write the header first
        try {
            if (!client.CheckExists(outputFileName)) {
                using (var stream = client.CreateFile(outputFileName, IfExists.Fail)) {
                    byte[] headerByteArray = Encoding.UTF8.GetBytes("EventDate, Longitude, Latitude, RadarSiteID, CellID, RangeNauticalMiles, Azimuth, SevereProbability, Probability, MaxSizeinInchesInUS, StateCode, CountyCode, TownshipCode, RangeCode\r\n");
                    //stream.Write(headerByteArray, 0, headerByteArray.Length);
                    client.ConcurrentAppend(outputFileName, true, headerByteArray, 0, headerByteArray.Length);
                }
            }
        } catch (Exception e) {
            log.LogInformation("multiple attempts to create the file. Ignoring this error, since the file was created.");
        }

        //the write the data
        byte[] textByteArray = Encoding.UTF8.GetBytes(Contents);
        for (int attempt = 0; attempt < 5; attempt++) {
            try {
                log.LogInformation("prior to write, the outputfile size is: " + client.GetDirectoryEntry(outputFileName).Length);
                var offset = client.GetDirectoryEntry(outputFileName).Length;
                client.ConcurrentAppend(outputFileName, false, textByteArray, 0, textByteArray.Length);
                log.LogInformation("AFTER write, the outputfile size is: " + client.GetDirectoryEntry(outputFileName).Length);
                //if successful, stop trying to write this row
                attempt = 6;                    
            }
            catch (Exception e){
                log.LogInformation($"exception on adls write: {e}");
            }
            Random rnd = new Random();
            Thread.Sleep(rnd.Next(attempt * 60));
        }
    }

該文件將在需要時創建，但是我確實在日志中收到幾條消息，其中有多個線程試圖創建該文件。 我並不總是寫標題行。

我也不再僅獲得任何數據行：

"BadRequest ( IllegalArgumentException  concurrentappend failed with error 0xffffffff83090a6f 
(Bad request. The target file does not support this particular type of append operation. 
If the concurrent append operation has been used with this file in the past, you need to append to this file using the concurrent append operation.
If the append operation with offset has been used in the past, you need to append to this file using the append operation with offset. 
On the same file, it is not possible to use both of these operations.). []

我覺得這里缺少一些基本的設計思想。 該代碼應嘗試將行寫入文件。 如果文件尚不存在，請創建該文件並將其放入標題行。然后，將該行放入。

完成這種寫方案的最佳實踐方法是什么？

關於如何在ADLS中處理這種並行寫入工作負載還有其他建議嗎？

Answer 1

我有點遲了，但是我想問題之一可能是由於在同一文件流上使用了“ Create”和“ ConcurrentAppend”？ ADLS文檔提到它們不能在同一文件上使用。 也許嘗試將“創建”命令更改為“ ConcurrentAppend”，因為后者不存在時可用於創建文件。

另外，如果您找到了更好的方法，請在此處發布您的解決方案。

C＃並行寫入Azure Data Lake File

問題描述

1 個解決方案

解決方案1
0 2019-01-28 21:08:23

C＃並行寫入Azure Data Lake File

問題描述

1 個解決方案

解決方案1 0 2019-01-28 21:08:23

解決方案1
0 2019-01-28 21:08:23