简体   繁体   English

在C#中使用多线程处理多个文件的最佳方法是什么?

[英]What is the best way to work with multiple files in multithread in C#?

I am creating a Windows Form application, where I select a folder that contains multiple *.txt files. 我正在创建Windows窗体应用程序,在其中选择包含多个* .txt文件的文件夹。 Their length may vary from few thousand lines (kB) to up to 50 milion lines (1GB). 它们的长度可能从几千行(kB)到多达5,000万行(1GB)不等。 Every line of the code has three informations. 该代码的每一行都有三个信息。 Date in long, location id in int and value in float all separated by semicolon (;). 日期以long表示,位置id以int表示,值float以分号(;)分隔。 I need to calculate min and max value in all those files and tell in which file it is, and then the most frequent value. 我需要计算所有这些文件中的最小值和最大值,并告诉它在哪个文件中,然后是最频繁的值。

I already have these files verified and stored in an arraylist. 我已经将这些文件验证并存储在arraylist中。 I am opening a thread to read the files one by one and I read the data by line. 我正在打开一个线程来逐个读取文件,然后逐行读取数据。 It works fine, but when there are 1GB files, I run out of memory. 它工作正常,但是当有1GB文件时,我的内存不足。 I tried to store the values in dictionary, where key would be the date and the value would be an object that contains all the info loaded from the line alongside with the filename. 我试图将值存储在字典中,其中key将是日期,而值将是一个对象,其中包含从该行加载的所有信息以及文件名。 I see I cannot use a dictionary, because at about 6M values, I ran out of memory. 我看到我无法使用字典,因为在大约6M的值时,我内存不足。 So I should probably do it in multithread. 所以我可能应该在多线程中执行。 I though I could run two threads, one that reads the file and puts the info in some kind of container and the other that reads from it and makes calculations and then deletes the values from the container. 虽然我可以运行两个线程,一个线程读取文件并将信息放入某种容器中,另一个线程从文件读取并进行计算,然后从容器中删除值。 But I don't know which container could do such thing. 但是我不知道哪个容器可以做到这一点。 Moreover I need to calculate the most frequent value, so they need to be stored somewhere which leads me back to some kind of dictionary, but I already know I will run out of memory. 此外,我需要计算最频繁的值,因此它们需要存储在某个位置,这会使我回到某种字典中,但是我已经知道我将用光内存。 I don't have much experience with threads either, so I don't know what is possible. 我也没有太多的线程经验,所以我不知道有什么可能。 Here is my code so far: 到目前为止,这是我的代码:

GUI: GUI:

namespace STI {
    public partial class GUI : Form {
        private String path = null;
        public static ArrayList txtFiles;

        public GUI() {
            InitializeComponent();
            _GUI1 = this;
        }

       //I run it in thread. I thought I would run the second 
       //one here that would work with the values inputed in some container
        private void buttonRun_Click(object sender, EventArgs e) {
            ThreadDataProcessing processing = new ThreadDataProcessing();
            Thread t_process = new Thread(processing.runProcessing);
            t_process.Start();

            //ThreadDataCalculating calculating = new ThreadDataCalculating();
            //Thread t_calc = new Thread(calculating.runCalculation());
            //t_calc.Start();

        }


    }
}

ThreadProcessing.cs ThreadProcessing.cs

namespace STI.thread_package {
    class ThreadDataProcessing {
        public static Dictionary<long, object> finalMap = new Dictionary<long, object>();

        public void runProcessing() {
            foreach (FileInfo file in GUI.txtFiles) {
                using (FileStream fs = File.Open(file.FullName.ToString(), FileMode.Open))
                using (BufferedStream bs = new BufferedStream(fs))
                using (StreamReader sr = new StreamReader(bs)) {
                    String line;
                    String[] splitted;
                    try { 
                        while ((line = sr.ReadLine()) != null) {
                            splitted = line.Split(';');

                            if (splitted.Length == 3) {
                                long date = long.Parse(splitted[0]);
                                int location = int.Parse(splitted[1]);
                                float value = float.Parse(splitted[2], CultureInfo.InvariantCulture);

                                Entry entry = new Entry(date, location, value, file.Name);

                                if (!finalMap.ContainsKey(entry.getDate())) {
                                    finalMap.Add(entry.getDate(), entry);

                                }
                            }
                        }
                        GUI._GUI1.update("File \"" + file.Name + "\" completed\n");
                    }
                    catch (FormatException ex) {
                        GUI._GUI1.update("Wrong file format.");
                    }
                    catch (OutOfMemoryException) {
                        GUI._GUI1.update("Out of memory");
                    }
                }

            }
        }
    }
}

and the object in which I put the values from lines: Entry.cs 以及将行中的值放入其中的对象:Entry.cs

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace STI.entities_package {
    class Entry {
        private long date;
        private int location;
        private float value;
        private String fileName;
        private int count;

        public Entry(long date, int location, float value, String fileName) {
            this.date = date;
            this.location = location;
            this.value = value;
            this.fileName = fileName;

            this.count = 1;
        }

        public long getDate() {
            return date;
        }

        public int getLocation() {
            return location;
        }

        public String getFileName() {
            return fileName;
        }

    }
}

I don't think that multithreading is going to help you here - it could help you separate the IO-bound tasks from the CPU-bound tasks, but your CPU-bound tasks are so trivial that I don't think they warrant their own thread. 我认为多线程不会在这里为您提供帮助-它可以帮助您将IO绑定任务与CPU绑定任务分开,但是您的CPU绑定任务是如此琐碎,以至于我认为它们并不能保证自己的任务线。 All multithreading is going to do is unnecessarily increase the problem complexity. 多线程将要做的所有事情不必要地增加了问题的复杂性。

Calculating the min/max in constant memory is trivial: just maintain a minFile and maxFile variable that gets updated when the current file's value is less-than minFile or greater-than maxFile. 计算常量内存中的最小值/最大值很简单:只需维护一个minFile和maxFile变量,当当前文件的值小于minFile或大于maxFile时,该变量将更新。 Finding the most frequent value is going to require more memory, but with only a few million files you ought to have enough RAM to store a Dictionary<float, int> that maintains the frequency of each value, after which you iterate through the map to determine which value had the highest frequency. 找到最频繁的值将需要更多的内存,但是只有几百万个文件,您应该有足够的RAM来存储Dictionary<float, int>来维护每个值的频率,此后您遍历映射到确定哪个值具有最高频率。 If for some reason you don't have enough RAM (make sure that your files are being closed and garbage collected if you're running out of memory, because a Dictionary<float, int> with a few million entries ought to fit in less than a gigabyte of RAM) then you can make multiple passes over the files: on the first pass store the values in a Dictionary<interval, int> where you've split up the interval between MIN_FLOAT and MAX_FLOAT into a few thousand sub-intervals, then on the next pass you can ignore all values that didn't fit into the interval with the highest frequency thus shrinking the dictionary's size. 如果由于某种原因您没有足够的RAM(如果内存不足,请确保关闭文件并进行垃圾回收,因为带有几百万个条目的Dictionary<float, int>应该小于那么您可以对文件进行多次传递:在第一次传递时,将值存储在Dictionary<interval, int> ,其中您将MIN_FLOAT和MAX_FLOAT之间的间隔分成了几千个子间隔,那么在下一遍,您可以忽略所有不符合最高频率间隔的值,从而缩小字典的大小。 However, the Dictionary<float, int> ought to fit into memory, so unless you start processing billions of files instead of millions of files you probably won't need a multi-pass procedure. 但是, Dictionary<float, int>应该适合内存,因此,除非您开始处理数十亿个文件而不是数百万个文件,否则可能不需要多遍处理。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM