简体   繁体   中英

tesseract multithreading c#

I have a code for tesseract to run in 1 instance how can i parallelize the code so that it can run in quad core processor or 8 core processor systems.here is my code block.thanks in advance.

  using (TesseractEngine engine = new TesseractEngine(@"./tessdata", "tel+tel1", EngineMode.Default))
        {

            foreach (string ab in files)
            {
                using (var pages = Pix.LoadFromFile(ab))
                {
                    using (Tesseract.Page page = engine.Process(pages,Tesseract.PageSegMode.SingleBlock))
                    {
                        string text = page.GetText();
                        OCRedText.Append(text);
                    }

                }
            }

The most simple way to run this code in parallel is using PLINQ . Calling AsParallel() on enumeration will automatically run query that follows it ( .Select(...) ) simultaneously on all available CPU cores.

It is crucial to run in parallel only thread-safe code. Assuming TesseractEngine is thread-safe (as you suggest in comment, I didn't verify it myself) as well as Pix.LoadFromFile(), then the only problematic part could be OCRedText.Append() . It is not clear from code, what OCRedText is, so I assume it is StringBuilder or List and therefore it is not thread-safe. So I removed this part from code that will run in parallel and I process it later in single-thread - since method .Append() is likely to run fast, this shouldn't have significant adverse effect on overall performance.

using (TesseractEngine engine = new TesseractEngine(@"./tessdata", "tel+tel1", EngineMode.Default))
{
    var texts = files.AsParallel().Select(ab =>
            {
                using (var pages = Pix.LoadFromFile(ab))
                {
                    using (Tesseract.Page page = engine.Process(pages, eract.PageSegMode.SingleBlock))
                    {
                        return page.GetText();
                    }
                }
            });

    foreach (string text in texts)
    {
        OCRedText.Append(text);
    }
}

This has worked for me:

static IEnumerable<string> Ocr(string directory, string sep)
    => Directory.GetFiles(directory, sep)
        .AsParallel()
        .Select(x =>
        {
            using var engine = new TesseractEngine(@"D:\source\repos\WPF\OcrTest\tessdata\", "deu", EngineMode.Default);
            using var img = Pix.LoadFromFile(x);
            using var page = engine.Process(img);
            return page.GetText();
        }).ToList();

I am no expert on the matter of parallelization, but this function ocr's 8 Tiff's in 12 seconds.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM