简体   繁体   English

如何在多线程模式下读取文件?

[英]How to read files in multithreaded mode?

I currently have a program that reads file (very huge) in single threaded mode and creates search index but it takes too long to index in single threaded environment. 我目前有一个程序在单线程模式下读取文件(非常大)并创建搜索索引,但在单线程环境中索引需要很长时间。

Now I am trying to make it work in multithreaded mode but not sure the best way to achieve that. 现在我试图让它在多线程模式下工作,但不确定实现它的最佳方法。

My main program creates a buffered reader and passes the instance to thread and the thread uses the buffered reader instance to read the files. 我的主程序创建一个缓冲的读取器并将实例传递给线程,线程使用缓冲的读取器实例来读取文件。

I don't think this works as expected rather each thread is reading the same line again and again. 我不认为这可以按预期工作,而是每个线程一次又一次地读同一行。

Is there a way to make the threads read only the lines that are not read by other thread? 有没有办法让线程只读取其他线程无法读取的行? Do I need to split the file? 我需要拆分文件吗? Is there a way to implement this without splitting the file? 有没有办法在不拆分文件的情况下实现这个?

Sample Main program: 样本主程序:

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.util.ArrayList;

public class TestMTFile {
    public static void main(String args[]) {
        BufferedReader reader = null;
        ArrayList<Thread> threads = new ArrayList<Thread>();
        try {
            reader = new BufferedReader(new FileReader(
                    "test.tsv"));
        } catch (FileNotFoundException e1) {
            e1.printStackTrace();
        }
        for (int i = 0; i <= 10; i++) {
            Runnable task = new ReadFileMT(reader);
            Thread worker = new Thread(task);
            // We can set the name of the thread
            worker.setName(String.valueOf(i));
            // Start the thread, never call method run() direct
            worker.start();
            // Remember the thread for later usage
            threads.add(worker);
        }

        int running = 0;
        int runner1 = 0;
        int runner2 = 0;
        do {
            running = 0;
            for (Thread thread : threads) {
                if (thread.isAlive()) {
                    runner1 = running++;
                }
            }
            if (runner2 != runner1) {
                runner2 = runner1;
                System.out.println("We have " + runner2 + " running threads. ");

            }
        } while (running > 0);

        if (running == 0) {
            System.out.println("Ended");
        }
    }
}

Thread: 线:

import java.io.BufferedReader;
import java.io.IOException;

public class ReadFileMT implements Runnable {
    BufferedReader bReader = null;

    ReadFileMT(BufferedReader reader) {
        this.bReader = reader;
    }

    public synchronized void run() {
        String line;
        try {
            while ((line = bReader.readLine()) != null) {

                try {
                    System.out.println(line);
                } catch (Exception e) {

                }
            }
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
    }
}

Your bottleneck is most likely the indexing, not the file reading. 您的瓶颈很可能是索引, 而不是文件读取。 assuming your indexing system supports multiple threads, you probably want a producer/consumer setup with one thread reading the file and pushing each line into a BlockingQueue (the producer), and multiple threads pulling lines from the BlockingQueue and pushing them into the index (the consumers). 假设你的索引系统支持多个线程,你可能想要一个生产者/消费者设置,其中一个线程读取文件并将每一行推入BlockingQueue(生产者),多个线程从BlockingQueue中拉出线并将它们推入索引(消费者)。

看到这个线程 - 如果你的文件都在同一个磁盘上,那么用一个线程读取它们就不会做得更好,尽管一旦你将它们读入主存储器就可以处理多个线程的文件。

If you can use Java 8, you may be able to do this quickly and easily using the Streams API. 如果您可以使用Java 8,则可以使用Streams API快速轻松地执行此操作。 Read the file into a MappedByteBuffer, which can open a file up to 2GB very quicky, then read the lines out of the buffer (you need to make sure your JVM has enough extra memory to hold the file): 将文件读入MappedByteBuffer,它可以非常快速地打开2GB的文件,然后读出缓冲区中的行(您需要确保您的JVM有足够的额外内存来保存文件):

package com.objective.stream;

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.stream.Stream;

public class StreamsFileProcessor {
    private MappedByteBuffer buffer;

    public static void main(String[] args){
        if (args[0] != null){
            Path myFile = Paths.get(args[0]);
            StreamsFileProcessor proc = new StreamsFileProcessor();
            try {
                proc.process(myFile);
            } catch (IOException e) {
                e.printStackTrace();
            }   
        }
    }

    public void process(Path file) throws IOException {
        readFileIntoBuffer(file);
        getBufferStream().parallel()
            .forEach(this::doIndex);
    }

    private Stream<String> getBufferStream() throws IOException {
        try (BufferedReader reader = new BufferedReader(new InputStreamReader(new ByteArrayInputStream(buffer.array())))){
            return reader.lines();
        }
    }

    private void readFileIntoBuffer(Path file) throws IOException{
        try(FileInputStream fis = new FileInputStream(file.toFile())){
            FileChannel channel = fis.getChannel();
            buffer = channel.map(FileChannel.MapMode.PRIVATE, 0, channel.size());
        }
    }

    private void doIndex(String s){
        // Do whatever I need to do to index the line here
    }
}

First, I agree with @Zim-Zam that it is the file IO, not the indexing, that is likely the rate determining step. 首先,我同意@ Zim-Zam,它是文件IO,而不是索引,这可能是速率确定步骤。 (So I disagree with @jtahlborn). (所以我不同意@jtahlborn)。 Depends on how complex the indexing is. 取决于索引的复杂程度。

Second, in your code, each thread has it's own, independent BufferedReader . 其次,在你的代码中,每个线程都有自己独立的BufferedReader Therefore they will all read the entire file. 因此他们都将读取整个文件。 One possible fix is to use a single BufferedReader that they share. 一种可能的解决方法是使用它们共享的单个BufferedReader And then you need to synchronize the BufferedReader.readLine() method (I think) since the javadocs are silent on whether BufferedReader is thread-safe. 然后你需要同步BufferedReader.readLine()方法(我认为),因为javadocs默认是否BufferedReader是线程安全的。 And, since I think the IO is the botleneck, this will become the bottleneck and I doubt if multithreading will gain you much. 而且,由于我认为IO是botleneck,这将成为瓶颈,我怀疑多线程是否会让你获益匪浅。 But give it a try, I have been wrong occasionally. 但试一试,我偶尔也会错。 :-) :-)

ps I agree with @jtahlmorn that a producer/consumer pattern is better than my share the BufferedReader idea, but that would be much more work for you. ps我同意@jtahlmorn认为生产者/消费者模式比我分享BufferedReader的想法更好,但这对你来说会更有用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM