简体   繁体   English

使用C#中的线程将大文本文件(500万条记录)并行拆分为较小的文件

[英]Split large text file (5 million records) into smaller files in parallel using threads in C#

I have a large text file containing 5 million of records (5 columns and 5 million rows). 我有一个包含500万条记录(5列和500万行)的大型文本文件。 The image of file is shown below 文件图像如下所示

要拆分的大文本文件

For splitting, I used the concept of threading. 对于拆分,我使用了线程的概念。 I created 10 threads for splitting the larger file. 我创建了10个线程来分割更大的文件。 I have used the string array to store the values while reading the larger file. 我在读取较大的文件时使用了字符串数组来存储值。 The code is shown below. 代码如下所示。

class Program
{
    const string sourceFileName = @"C:\Users\Public\TestFolder\ThreadingExp\NewMarketData.txt";
    const string destinationFileName = @"C:\Users\Public\TestFolder\ThreadingExp\NewMarketData-Part-{0}.txt";

    static void Main(string[] args)
    {
        int[] index = new int[20];
        index[0] = 0;
        for(int i=1;i<11;i++)
        {
            index[i] = index[i-1]+500000;
        }

        //Reading Part
        var sourceFile = new StreamReader(sourceFileName);
        string[] ListLines = new string[5000000];
        for (int i = 0; i < 5000000; i++)
        {
            ListLines[i] = sourceFile.ReadLine();
        }            

        //Creating array of threads
        Thread[] ArrayofThreads = new Thread[10];
        for (int i = 0; i < ArrayofThreads.Length; i++)
        {
            ArrayofThreads[i] = new Thread(() => Writing(ListLines,index[i], index[i+1]));
            ArrayofThreads[i].Start();
        }

        for (int i = 0; i < ArrayofThreads.Length; i++)
        {
            ArrayofThreads[i].Join();
        }
    }
    static void Writing(string[] array, int a, int b)
    {
        //Getting the thread number
        int id= Thread.CurrentThread.ManagedThreadId;

        var destinationFile = new StreamWriter(string.Format(destinationFileName,id));

        string line;
        for (int i = a; i< b;i++ )
        {
            line = array[i];
            destinationFile.WriteLine(line);
        }

        destinationFile.Close();         
    }

}   

The code works fine. 代码工作正常。 Writing to different files is done in parallel here. 这里并行写入不同的文件。 But for reading, I have stored the whole content in a single array and then pass through different threads for writing using indexing. 但是对于阅读,我已经将整个内容存储在一个数组中,然后通过不同的线程传递,以便使用索引进行写入。 I want to do both the tasks (read the larger file and write in the different small files) in parallel using threads. 我想使用线程并行执行这两项任务(读取较大的文件并写入不同的小文件)。

You're almost certainly better off doing this with a single thread. 使用单个线程,你几乎肯定会做得更好。

First, you must read the text file sequentially. 首先,您必须按顺序读取文本文件。 There's no shortcut that will let you skip ahead and find the 500,000th line without first reading the 499,999 lines that come before it. 没有捷径可以让你向前跳过并找到第500,000行而不先读取前面的499,999行。

Second, even if you could do that, the disk drive can only service a single request at a time. 其次,即使您可以这样做,磁盘驱动器也只能一次为一个请求提供服务。 It can't be reading from two places at the same time. 它不能同时从两个地方读书。 So while you're reading one part of the file, the thread that wants to read another part of the file is just sitting there waiting for the disk drive. 因此,当您正在读取文件的一部分时,想要读取文件另一部分的线程只是坐在那里等待磁盘驱动器。

Finally, unless your output files are on separate drives, you have the same problem as with reading: the disk drive can only do one thing at a time. 最后,除非您的输出文件位于不同的驱动器上,否则您遇到与读取相同的问题:磁盘驱动器一次只能执行一项操作。

So you're better off to start with something simple: 所以你最好开始简单的事情:

const int maxLinesPerFile = 5000000;
int fileNumber = 0;
var destinationFile = File.CreateText("outputFile"+fileNumber);

int linesRead = 0;
foreach (var line in File.ReadLines(inputFile))
{
    ++linesRead;
    if (linesRead > maxLinesPerFile)
    {
        destinationFile.Close();
        ++fileNumber;
        destinationFile = File.CreateText("outputFile"+fileNumber);
    }
    destinationFile.WriteLine(line);
}
destinationFile.Close();

If your input and output files are on separate drives, you could potentially save a little bit of time by having two threads: one for input and one for output. 如果输入和输出文件位于不同的驱动器上,则可以通过两个线程节省一点时间:一个用于输入,一个用于输出。 They would communicate using a BlockingCollection . 他们将使用BlockingCollection通信。 Basically, the input thread would put lines onto the queue and the output thread would read from the queue and output the files. 基本上,输入线程会将行放入队列,输出线程将从队列中读取并输出文件。 In theory that will overlap the reading time with the writing time, but the truth is that the queue fills up and the reader ends up having to wait on the writing thread. 从理论上讲,读取时间与写入时间重叠,但事实是队列填满了,读者最终不得不等待写作线程。 You get some increase in performance, but not nearly what you'd expect. 你的性能有所提高,但并不是你所期望的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM