多線程任務以處理C＃中的文件

Question

我已經閱讀了很多有關線程的內容，但無法弄清楚如何找到解決問題的方法。 首先讓我介紹一下問題。 我有一些文件需要處理。 主機名和文件路徑位於兩個數組中。

在此處輸入圖片說明
現在，我想設置幾個線程來處理文件。 創建的線程數基於三個因素：
A）在所有情況下，最大線程數不能超過唯一主機名的數量。
B）具有相同主機名的文件必須按順序處理。 IE我們不能同時處理host1 _file1和host1 _file2。 （數據完整性將面臨風險，這是我無法控制的。
C）用戶可以限制可用於處理的線程數。 線程數仍受上面條件A的限制。 這完全是由於以下事實：如果我們有大量的主機，比如說50 ..我們可能不希望同時處理50個線程。

在上面的示例中，最多可以創建6個線程。

最佳處理程序如下所示。

最佳加工程序

public class file_prep_obj
{
    public string[] file_paths;
    public string[] hostname;
    public Dictionary<string, int> my_dictionary;

    public void get_files()
    {
        hostname = new string[]{ "host1", "host1", "host1", "host2", "host2", "host3", "host4","host4","host5","host6" };
        file_paths=new string[]{"C:\\host1_file1","C:\\host1_file2","C:\\host1_file3","C:\\host2_file1","C:\\host2_file2","C:\\host2_file2",
                                "C:\\host3_file1","C:\\host4_file1","C:\\host4_file2","C:\\host5_file1","C:\\host6_file1"};
        //The dictionary provides a count on the number of files that need to be processed for a particular host.
        my_dictionary = hostname.GroupBy(x => x)
                        .ToDictionary(g => g.Key,
                        g => g.Count());
    }
}

//This class contains a list of file_paths associated with the same host.
//The group_file_host_name will be the same for a host.
class host_file_thread
{
    public string[] group_file_paths;
    public string[] group_file_host_name;

    public void process_file(string file_path_in)
    {
        var time_delay_random=new Random();
        Console.WriteLine("Started processing File: " + file_path_in);
        Task.Delay(time_delay_random.Next(3000)+1000);
        Console.WriteLine("Completed processing File: " + file_path_in);
    }
}

class Program
{
    static void Main(string[] args)
    {
        file_prep_obj my_files=new file_prep_obj();
        my_files.get_files();
        //Create our host objects... my_files.my_dictionary.Count represents the max number of threads
        host_file_thread[] host_thread=new host_file_thread[my_files.my_dictionary.Count];

        int key_pair_count=0;
        int file_path_position=0;
        foreach (KeyValuePair<string, int> pair in my_files.my_dictionary)
        {
            host_thread[key_pair_count] = new host_file_thread();   //Initialise the host_file_thread object. Because we have an array of a customised object
            host_thread[key_pair_count].group_file_paths=new string[pair.Value];        //Initialise the group_file_paths
            host_thread[key_pair_count].group_file_host_name=new string[pair.Value];    //Initialise the group_file_host_name


            for(int j=0;j<pair.Value;j++)
            {
                host_thread[key_pair_count].group_file_host_name[j]=pair.Key.ToString();                        //Group the hosts
                host_thread[key_pair_count].group_file_paths[j]=my_files.file_paths[file_path_position];        //Group the file_paths
                file_path_position++;
            }
            key_pair_count++;
        }//Close foreach (KeyValuePair<string, int> pair in my_files.my_dictionary)

        //TODO PROCESS FILES USING host_thread objects. 
    }//Close static void Main(string[] args)
}//Close Class Program

我想我所需要的是有關如何按照上述規范編寫線程處理例程的指南。

Answer 1

您可以使用Stephen Toub的ForEachAsync擴展方法來處理文件。 它允許您指定要使用的並發線程數，並且它是非阻塞的，因此釋放了您的主線程來進行其他處理。 這是文章中的方法：

public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
    return Task.WhenAll(
        from partition in Partitioner.Create(source).GetPartitions(dop)
        select Task.Run(async delegate
        {
            using (partition)
                while (partition.MoveNext())
                    await body(partition.Current);
        }));
}

為了使用它，我稍微重構了您的代碼。 我將字典更改為Dictionary<string, List<string>> ，它基本上將主機作為鍵，然后將所有路徑作為值。 我假設文件路徑將在其中包含主機名。

   my_dictionary = (from h in hostname
                    from f in file_paths
                    where f.Contains(h)
                    select new { Hostname = h, File = f }).GroupBy(x => x.Hostname)
                    .ToDictionary(x => x.Key, x => x.Select(s => s.File).Distinct().ToList());

我也將您的process_file方法更改為async就像您在其中使用Task.Delay一樣，您需要await它，否則它什么也不做。

public static async Task process_file(string file_path_in)
{
    var time_delay_random = new Random();
    Console.WriteLine("Started:{0} ThreadId:{1}", file_path_in, Thread.CurrentThread.ManagedThreadId);
    await Task.Delay(time_delay_random.Next(3000) + 1000);
    Console.WriteLine("Completed:{0} ThreadId:{1}", file_path_in, Thread.CurrentThread.ManagedThreadId);
}

要使用該代碼，您需要獲取要使用的最大線程數，並將其傳遞給my_files.my_dictionary.ForEachAsync 。 您還提供了一個異步委托，該委托為特定主機處理每個文件，並依次等待每個文件進行處理。

public static async Task MainAsync()
{
    var my_files = new file_prep_obj();
    my_files.get_files();

    const int userSuppliedMaxThread = 5;
    var maxThreads = Math.Min(userSuppliedMaxThread, my_files.my_dictionary.Values.Count());
    Console.WriteLine("MaxThreads = " + maxThreads);

    foreach (var pair in my_files.my_dictionary)
    {
        foreach (var path in pair.Value)
        {
            Console.WriteLine("Key= {0}, Value={1}", pair.Key, path);   
        }            
    }

    await my_files.my_dictionary.ForEachAsync(maxThreads, async (pair) =>
    {
        foreach (var path in pair.Value)
        {
            // serially process each path for a particular host.
            await process_file(path);
        }
    });

}

static void Main(string[] args)
{
    MainAsync().Wait();
    Console.ReadKey();

}//Close static void Main(string[] args)

烏普特

MaxThreads = 5
Key= host1, Value=C:\host1_file1
Key= host1, Value=C:\host1_file2
Key= host1, Value=C:\host1_file3
Key= host2, Value=C:\host2_file1
Key= host2, Value=C:\host2_file2
Key= host3, Value=C:\host3_file1
Key= host4, Value=C:\host4_file1
Key= host4, Value=C:\host4_file2
Key= host5, Value=C:\host5_file1
Key= host6, Value=C:\host6_file1
Started:C:\host1_file1 ThreadId:10
Started:C:\host2_file1 ThreadId:12
Started:C:\host3_file1 ThreadId:13
Started:C:\host4_file1 ThreadId:11
Started:C:\host5_file1 ThreadId:10
Completed:C:\host1_file1 ThreadId:13
Completed:C:\host2_file1 ThreadId:12
Started:C:\host1_file2 ThreadId:13
Started:C:\host2_file2 ThreadId:12
Completed:C:\host2_file2 ThreadId:11
Completed:C:\host1_file2 ThreadId:13
Started:C:\host6_file1 ThreadId:11
Started:C:\host1_file3 ThreadId:13
Completed:C:\host5_file1 ThreadId:11
Completed:C:\host4_file1 ThreadId:12
Completed:C:\host3_file1 ThreadId:13
Started:C:\host4_file2 ThreadId:12
Completed:C:\host1_file3 ThreadId:11
Completed:C:\host6_file1 ThreadId:13
Completed:C:\host4_file2 ThreadId:12

Answer 2

我正在研究您的問題，並提出了以下方法。 可能不是最好的，但我相信它可以滿足您的需求。

在開始之前，我非常喜歡擴展方法，所以這里是一個：

public static class IEnumerableExtensions
{
    public static void Each<T>(this IEnumerable<T> ie, Action<T, int> action)
    {
        var i = 0;
        foreach (var e in ie) action(e, i++);
    }
}

這是在一個集合（foreach）上循環，但保留項目和索引。 您稍后將看到為什么需要這樣做。

然后我們有了變量。

public static string[] group_file_paths =
{
    "host1", "host1", "host1", "host2", "host2", "host3", "host4", "host4",
    "host5", "host6"
};

public static string[] group_file_host_name =
{
    @"c:\\host1_file1", @"c:\\host1_file2", @"c:\\host1_file3", @"c:\\host2_file1", @"c:\\host2_file2", @"c:\\host3_file1",
    @"c:\\host4_file1", @"c:\\host4_file2", @"c:\\host5_file1", @"c:\\host5_file2", @"c:\\host6_file1" 
};

然后是主要代碼：

public static void Main(string[] args)
{
    Dictionary<string, List<string>> filesToProcess = new Dictionary<string, List<string>>();

    // Loop over the 2 arrays and creates a directory that contains the host as the key, and then all the filenames.
    group_file_paths.Each((host, hostIndex) =>
    {
        if (filesToProcess.ContainsKey(host))       
        { filesToProcess[host].Add(group_file_host_name[hostIndex]); }
        else
        {
            filesToProcess.Add(host, new List<string>());
            filesToProcess[host].Add(group_file_host_name[hostIndex]);
        }
    });

    var tasks = new List<Task>();

    foreach (var kvp in filesToProcess)
    {
        tasks.Add(Task.Factory.StartNew(() => 
        {
            foreach (var file in kvp.Value)
            {
                process_file(kvp.Key, file);
            }
        }));
    }

    var handleTaskCompletionTask = Task.WhenAll(tasks);
    handleTaskCompletionTask.Wait();
}

這里可能需要一些解釋：

因此，我正在創建一個字典，其中將包含您的主機作為鍵，而值則包含需要處理的文件列表。

您的字典將如下所示：

主機1
- 文件1
- 文件2
主機2
- 文件1
主機3
- 文件1
- 文件2
- 文件3

之后，我將創建將通過使用TPL執行的任務的集合。 我現在執行所有任務，並且正在等待所有任務完成。

您的處理方法似乎如下，僅用於測試目的：

    public static void process_file(string host, string file)
    {
        var time_delay_random = new Random();
        Console.WriteLine("Host '{0}' - Started processing the file {1}.", host, file);
        Thread.Sleep(time_delay_random.Next(3000) + 1000);
        Console.WriteLine("Host '{0}' - Completed processing the file {1}.", host, file);
        Console.WriteLine("");
    }

這篇文章沒有提供自己設置線程的方法，但是可以通過在任務上使用完成處理程序來輕松實現。 與完成任何任務相比，您可以再次遍歷集合並開始一個尚未完成的新任務。

因此，希望對您有所幫助。

Answer 3

我將從更好地組織數據結構開始。 具有兩個單獨的數組不僅增加了數據重復，而且還創建了隱式耦合，這對於查看您的代碼的人而言可能並不明顯。

可以保存有關單個任務的信息的類可能類似於：

public class TaskInfo
{
    private readonly string _hostName;
    public string HostName
    {
        get { return _hostName; }
    }

    private readonly ReadOnlyCollection<string> _files;
    public ReadOnlyCollection<string> Files
    {
        get { return _files; }
    }

    public TaskInfo(string host, IEnumerable<string> files)
    {
        _hostName = host;
        _files = new ReadOnlyCollection<string>(files.ToList());
    }
}

現在，創建任務列表更加簡單：

var list = new List<TaskInfo>()
{
    new TaskInfo(
        host: "host1",
        files: new[] { @"c:\host1\file1.txt", @"c:\host1\file2.txt" }),

    new TaskInfo(
        host: "host2",
        files: new[] { @"c:\host2\file1.txt", @"c:\host2\file2.txt" })

    /* ... */
};

現在您已經准備好任務，您可以簡單地使用System.Threading.Tasks命名空間中的各種類來並行調用它們。 如果您確實想限制並發任務的數量，則可以使用MaxDegreeOfParallelism屬性：

Parallel.ForEach(
    list, 
    new ParallelOptions() { MaxDegreeOfParallelism = 10 },
    taskInfo => Process(taskInfo)
);

如果要創建自己的線程池，則還可以使用具有多個使用者線程的ConcurrentQueue來實現類似的目的，可能需要等待WaitHandle的列表來知道何時完成。

Answer 4

我認為ThreadPool是您的理想解決方案。 它將自己處理線程並將其工作排隊。 此外，您可以設置最大線程數限制，即使您擁有的線程數超過最大數量，它也將使您的工作排隊。

ThreadPool.SetMaxThreads([YourMaxThreads],[YourMaxThreads]);

foreach (var t in host_thread)
{
    ThreadPool.QueueUserWorkItem(Foo, t);
}

private static void Foo(object thread)
{
    foreach (var file in (thread as host_file_thread).group_file_paths)
    {
        (thread as host_file_thread).process_file(file);
    }
}

盡管我建議您更改數據結構並保留process_file方法

多線程任務以處理C＃中的文件

問題描述

4 個解決方案

解決方案1
2 已采納 2014-07-08 10:01:04

解決方案2
1 2014-07-08 07:00:55

解決方案3
1 2014-07-08 07:06:57

解決方案4
0 2014-07-08 06:15:28

多線程任務以處理C＃中的文件

問題描述

4 個解決方案

解決方案1 2 已采納 2014-07-08 10:01:04

解決方案2 1 2014-07-08 07:00:55

解決方案3 1 2014-07-08 07:06:57

解決方案4 0 2014-07-08 06:15:28

解決方案1
2 已采納 2014-07-08 10:01:04

解決方案2
1 2014-07-08 07:00:55

解決方案3
1 2014-07-08 07:06:57

解決方案4
0 2014-07-08 06:15:28