简体   繁体   English

多线程任务以处理C#中的文件

[英]Multithreading task to process files in c#

I've been reading a lot about threading but can't figure out how to find a solution to my issue. 我已经阅读了很多有关线程的内容,但无法弄清楚如何找到解决问题的方法。 First let me introduce the problem. 首先让我介绍一下问题。 I have files which need to be processed. 我有一些文件需要处理。 The hostname and filepath are located in two arrays. 主机名和文件路径位于两个数组中。

在此处输入图片说明
Now I want to setup several threads to process the files. 现在,我想设置几个线程来处理文件。 The number of threads to create is based on three factors: 创建的线程数基于三个因素:
A) The maximum thread count cannot exceed the number of unique hostnames in all scenarios. A)在所有情况下,最大线程数不能超过唯一主机名的数量。
B) Files with the same hostname MUST be processed sequentially. B)具有相同主机名的文件必须按顺序处理。 IE We cannot process host1 _file1 and host1 _file2 at the same time. IE我们不能同时处理host1 _file1和host1 _file2。 (Data integrity will be put at risk and this is beyond my control. (数据完整性将面临风险,这是我无法控制的。
C) The user may throttle the number of threads available for processing. C)用户可以限制可用于处理的线程数。 The number of threads is still limited by condition A from above. 线程数仍受上面条件A的限制。 This is purely due to the fact that if we had an large number of hosts let's say 50.. we might not want 50 threads processing at the same time. 这完全是由于以下事实:如果我们有大量的主机,比如说50 ..我们可能不希望同时处理50个线程。

In the example above a maximum of 6 threads can be created. 在上面的示例中,最多可以创建6个线程。

The optimal processing routine is shown below. 最佳处理程序如下所示。

最佳加工程序


public class file_prep_obj
{
    public string[] file_paths;
    public string[] hostname;
    public Dictionary<string, int> my_dictionary;

    public void get_files()
    {
        hostname = new string[]{ "host1", "host1", "host1", "host2", "host2", "host3", "host4","host4","host5","host6" };
        file_paths=new string[]{"C:\\host1_file1","C:\\host1_file2","C:\\host1_file3","C:\\host2_file1","C:\\host2_file2","C:\\host2_file2",
                                "C:\\host3_file1","C:\\host4_file1","C:\\host4_file2","C:\\host5_file1","C:\\host6_file1"};
        //The dictionary provides a count on the number of files that need to be processed for a particular host.
        my_dictionary = hostname.GroupBy(x => x)
                        .ToDictionary(g => g.Key,
                        g => g.Count());
    }
}

//This class contains a list of file_paths associated with the same host.
//The group_file_host_name will be the same for a host.
class host_file_thread
{
    public string[] group_file_paths;
    public string[] group_file_host_name;

    public void process_file(string file_path_in)
    {
        var time_delay_random=new Random();
        Console.WriteLine("Started processing File: " + file_path_in);
        Task.Delay(time_delay_random.Next(3000)+1000);
        Console.WriteLine("Completed processing File: " + file_path_in);
    }
}

class Program
{
    static void Main(string[] args)
    {
        file_prep_obj my_files=new file_prep_obj();
        my_files.get_files();
        //Create our host objects... my_files.my_dictionary.Count represents the max number of threads
        host_file_thread[] host_thread=new host_file_thread[my_files.my_dictionary.Count];

        int key_pair_count=0;
        int file_path_position=0;
        foreach (KeyValuePair<string, int> pair in my_files.my_dictionary)
        {
            host_thread[key_pair_count] = new host_file_thread();   //Initialise the host_file_thread object. Because we have an array of a customised object
            host_thread[key_pair_count].group_file_paths=new string[pair.Value];        //Initialise the group_file_paths
            host_thread[key_pair_count].group_file_host_name=new string[pair.Value];    //Initialise the group_file_host_name


            for(int j=0;j<pair.Value;j++)
            {
                host_thread[key_pair_count].group_file_host_name[j]=pair.Key.ToString();                        //Group the hosts
                host_thread[key_pair_count].group_file_paths[j]=my_files.file_paths[file_path_position];        //Group the file_paths
                file_path_position++;
            }
            key_pair_count++;
        }//Close foreach (KeyValuePair<string, int> pair in my_files.my_dictionary)

        //TODO PROCESS FILES USING host_thread objects. 
    }//Close static void Main(string[] args)
}//Close Class Program



I guess what I'm after is a guide on how to code the threaded processing routines that are in accordance with the specs above. 我想我所需要的是有关如何按照上述规范编写线程处理例程的指南。

You can use Stephen Toub's ForEachAsync extension method to process the files. 您可以使用Stephen Toub的ForEachAsync扩展方法来处理文件。 It allows you to specify how many concurrent threads you want to use, and it is non-blocking so it frees up your main thread to do other processing. 它允许您指定要使用的并发线程数,并且它是非阻塞的,因此释放了您的主线程来进行其他处理。 Here is the method from the article: 这是文章中的方法:

public static Task ForEachAsync<T>(this IEnumerable<T> source, int dop, Func<T, Task> body)
{
    return Task.WhenAll(
        from partition in Partitioner.Create(source).GetPartitions(dop)
        select Task.Run(async delegate
        {
            using (partition)
                while (partition.MoveNext())
                    await body(partition.Current);
        }));
}

In order to use it I refactored your code slightly. 为了使用它,我稍微重构了您的代码。 I changed the dictionary to be of type Dictionary<string, List<string>> and it basically holds the host as the key and then all the paths as the values. 我将字典更改为Dictionary<string, List<string>> ,它基本上将主机作为键,然后将所有路径作为值。 I assumed the file path will contain the host name in it. 我假设文件路径将在其中包含主机名。

   my_dictionary = (from h in hostname
                    from f in file_paths
                    where f.Contains(h)
                    select new { Hostname = h, File = f }).GroupBy(x => x.Hostname)
                    .ToDictionary(x => x.Key, x => x.Select(s => s.File).Distinct().ToList());

I also changed your process_file method to be async as you were using Task.Delay inside it, which you need to await otherwise it doesn't do anything. 我也将您的process_file方法更改为async就像您在其中使用Task.Delay一样,您需要await它,否则它什么也不做。

public static async Task process_file(string file_path_in)
{
    var time_delay_random = new Random();
    Console.WriteLine("Started:{0} ThreadId:{1}", file_path_in, Thread.CurrentThread.ManagedThreadId);
    await Task.Delay(time_delay_random.Next(3000) + 1000);
    Console.WriteLine("Completed:{0} ThreadId:{1}", file_path_in, Thread.CurrentThread.ManagedThreadId);
}

To use the code, you get the maximum number of threads you want to use and pass that to my_files.my_dictionary.ForEachAsync . 要使用该代码,您需要获取要使用的最大线程数,并将其传递给my_files.my_dictionary.ForEachAsync You also supply an asynchronous delegate which processes each of the files for a particular host and sequentially awaits each one to be processed. 您还提供了一个异步委托,该委托为特定主机处理每个文件,并依次等待每个文件进行处理。

public static async Task MainAsync()
{
    var my_files = new file_prep_obj();
    my_files.get_files();

    const int userSuppliedMaxThread = 5;
    var maxThreads = Math.Min(userSuppliedMaxThread, my_files.my_dictionary.Values.Count());
    Console.WriteLine("MaxThreads = " + maxThreads);

    foreach (var pair in my_files.my_dictionary)
    {
        foreach (var path in pair.Value)
        {
            Console.WriteLine("Key= {0}, Value={1}", pair.Key, path);   
        }            
    }

    await my_files.my_dictionary.ForEachAsync(maxThreads, async (pair) =>
    {
        foreach (var path in pair.Value)
        {
            // serially process each path for a particular host.
            await process_file(path);
        }
    });

}

static void Main(string[] args)
{
    MainAsync().Wait();
    Console.ReadKey();

}//Close static void Main(string[] args)

Ouput 乌普特

MaxThreads = 5
Key= host1, Value=C:\host1_file1
Key= host1, Value=C:\host1_file2
Key= host1, Value=C:\host1_file3
Key= host2, Value=C:\host2_file1
Key= host2, Value=C:\host2_file2
Key= host3, Value=C:\host3_file1
Key= host4, Value=C:\host4_file1
Key= host4, Value=C:\host4_file2
Key= host5, Value=C:\host5_file1
Key= host6, Value=C:\host6_file1
Started:C:\host1_file1 ThreadId:10
Started:C:\host2_file1 ThreadId:12
Started:C:\host3_file1 ThreadId:13
Started:C:\host4_file1 ThreadId:11
Started:C:\host5_file1 ThreadId:10
Completed:C:\host1_file1 ThreadId:13
Completed:C:\host2_file1 ThreadId:12
Started:C:\host1_file2 ThreadId:13
Started:C:\host2_file2 ThreadId:12
Completed:C:\host2_file2 ThreadId:11
Completed:C:\host1_file2 ThreadId:13
Started:C:\host6_file1 ThreadId:11
Started:C:\host1_file3 ThreadId:13
Completed:C:\host5_file1 ThreadId:11
Completed:C:\host4_file1 ThreadId:12
Completed:C:\host3_file1 ThreadId:13
Started:C:\host4_file2 ThreadId:12
Completed:C:\host1_file3 ThreadId:11
Completed:C:\host6_file1 ThreadId:13
Completed:C:\host4_file2 ThreadId:12

I was playing around with your problem and came up with the folllowing approach. 我正在研究您的问题,并提出了以下方法。 It might not be the best, but I believe it suits your needs. 可能不是最好的,但我相信它可以满足您的需求。

Before we begin, I'm a big fan of extension methods, so here is one: 在开始之前,我非常喜欢扩展方法,所以这里是一个:

public static class IEnumerableExtensions
{
    public static void Each<T>(this IEnumerable<T> ie, Action<T, int> action)
    {
        var i = 0;
        foreach (var e in ie) action(e, i++);
    }
}

What this does is looping over a collection (foreach) but keeping the item and the index. 这是在一个集合(foreach)上循环,但保留项目和索引。 You'll see why this is needed later. 您稍后将看到为什么需要这样做。

Then we have the variables. 然后我们有了变量。

public static string[] group_file_paths =
{
    "host1", "host1", "host1", "host2", "host2", "host3", "host4", "host4",
    "host5", "host6"
};

public static string[] group_file_host_name =
{
    @"c:\\host1_file1", @"c:\\host1_file2", @"c:\\host1_file3", @"c:\\host2_file1", @"c:\\host2_file2", @"c:\\host3_file1",
    @"c:\\host4_file1", @"c:\\host4_file2", @"c:\\host5_file1", @"c:\\host5_file2", @"c:\\host6_file1" 
};

Then the main code: 然后是主要代码:

public static void Main(string[] args)
{
    Dictionary<string, List<string>> filesToProcess = new Dictionary<string, List<string>>();

    // Loop over the 2 arrays and creates a directory that contains the host as the key, and then all the filenames.
    group_file_paths.Each((host, hostIndex) =>
    {
        if (filesToProcess.ContainsKey(host))       
        { filesToProcess[host].Add(group_file_host_name[hostIndex]); }
        else
        {
            filesToProcess.Add(host, new List<string>());
            filesToProcess[host].Add(group_file_host_name[hostIndex]);
        }
    });

    var tasks = new List<Task>();

    foreach (var kvp in filesToProcess)
    {
        tasks.Add(Task.Factory.StartNew(() => 
        {
            foreach (var file in kvp.Value)
            {
                process_file(kvp.Key, file);
            }
        }));
    }

    var handleTaskCompletionTask = Task.WhenAll(tasks);
    handleTaskCompletionTask.Wait();
}

Some explanation might be needed here: 这里可能需要一些解释:

So I'm creating a dictionary that will contains your hosts as the key and as the value a list of files that needs to be processed. 因此,我正在创建一个字典,其中将包含您的主机作为键,而值则包含需要处理的文件列表。

Your dictionary will look like: 您的字典将如下所示:

  • Host1 主机1
    • file 1 文件1
    • file 2 文件2
  • Host 2 主机2
    • file 1 文件1
  • Host 3 主机3
    • File 1 文件1
    • File 2 文件2
    • File 3 文件3

After that I'm creating a collection of tasks that will be executed by using TPL. 之后,我将创建将通过使用TPL执行的任务的集合。 I execute all the tasks right now and I'm waiting for all the tasks to finish. 我现在执行所有任务,并且正在等待所有任务完成。

Your process method seems as follow, just for testing purposes: 您的处理方法似乎如下,仅用于测试目的:

    public static void process_file(string host, string file)
    {
        var time_delay_random = new Random();
        Console.WriteLine("Host '{0}' - Started processing the file {1}.", host, file);
        Thread.Sleep(time_delay_random.Next(3000) + 1000);
        Console.WriteLine("Host '{0}' - Completed processing the file {1}.", host, file);
        Console.WriteLine("");
    }

This post does not include a way to set the threads yourself but it can be easily achieved by using a completion handler on the tasks. 这篇文章没有提供自己设置线程的方法,但是可以通过在任务上使用完成处理程序来轻松实现。 Than when any task complete, you can loop again over your collection and start a new task that hasn't been finished yet. 与完成任何任务相比,您可以再次遍历集合并开始一个尚未完成的新任务。

So, I hope it helps. 因此,希望对您有所帮助。

I would start by organizing your data structure a bit better. 我将从更好地组织数据结构开始。 Having two separate arrays not only increases data duplication, but also creates implicit coupling which may not be obvious to the person looking at your code. 具有两个单独的数组不仅增加了数据重复,而且还创建了隐式耦合,这对于查看您的代码的人而言可能并不明显。

A class which would hold information about a single task might look something like: 可以保存有关单个任务的信息的类可能类似于:

public class TaskInfo
{
    private readonly string _hostName;
    public string HostName
    {
        get { return _hostName; }
    }

    private readonly ReadOnlyCollection<string> _files;
    public ReadOnlyCollection<string> Files
    {
        get { return _files; }
    }

    public TaskInfo(string host, IEnumerable<string> files)
    {
        _hostName = host;
        _files = new ReadOnlyCollection<string>(files.ToList());
    }
}

Creating a list of tasks is now much more straightforward: 现在,创建任务列表更加简单:

var list = new List<TaskInfo>()
{
    new TaskInfo(
        host: "host1",
        files: new[] { @"c:\host1\file1.txt", @"c:\host1\file2.txt" }),

    new TaskInfo(
        host: "host2",
        files: new[] { @"c:\host2\file1.txt", @"c:\host2\file2.txt" })

    /* ... */
};

And now that you have your tasks ready, you can simply use various classes from the System.Threading.Tasks namespace to invoke them in parallel. 现在您已经准备好任务,您可以简单地使用System.Threading.Tasks命名空间中的各种类来并行调用它们。 If you really want to limit the number of concurrent tasks, you can simply use the MaxDegreeOfParallelism property: 如果您确实想限制并发任务的数量,则可以使用MaxDegreeOfParallelism属性:

Parallel.ForEach(
    list, 
    new ParallelOptions() { MaxDegreeOfParallelism = 10 },
    taskInfo => Process(taskInfo)
);

If you wanted to create your own pool of threads, you could have also achieved a similar thing using a ConcurrentQueue with multiple consumer threads, possibly waiting on a list of WaitHandle s to know when they're done. 如果要创建自己的线程池,则还可以使用具有多个使用者线程的ConcurrentQueue来实现类似的目的,可能需要等待WaitHandle的列表来知道何时完成。

I think ThreadPool is the perfect solution for you. 我认为ThreadPool是您的理想解决方案。 It will handle the threads by itself and queue their work. 它将自己处理线程并将其工作排队。 Moreover, you can set the maximum threads limit and it will still queue your work even if you have more than the maximum number of threads. 此外,您可以设置最大线程数限制,即使您拥有的线程数超过最大数量,它也将使您的工作排队。

ThreadPool.SetMaxThreads([YourMaxThreads],[YourMaxThreads]);

foreach (var t in host_thread)
{
    ThreadPool.QueueUserWorkItem(Foo, t);
}


private static void Foo(object thread)
{
    foreach (var file in (thread as host_file_thread).group_file_paths)
    {
        (thread as host_file_thread).process_file(file);
    }
}

Although I would suggest you change your data structure and keep the process_file method from it 尽管我建议您更改数据结构并保留process_file方法

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM