简体   繁体   English

如何获得 Alpine linux 的 tesseract 常规英语语言 package?

[英]How can I get tesseract regular english language package for Alpine linux?

I am building a docker image based on alpine that has a dependency with tesseract for OCR.我正在构建一个基于 alpine 的 docker 图像,它依赖于 OCR 的 tesseract。 The tesseract site list two flavors of English, eng (modern english) and enm (middle english). tesseract 网站列出了两种英语风格,eng(现代英语)和 enm(中古英语)。 However, I am having issues getting the eng version installed on Alpine.但是,我在 Alpine 上安装 eng 版本时遇到问题。

My Dockerfile has the following:我的 Dockerfile 具有以下内容:

FROM eclipse-temurin:17-jre-alpine as tesseract-master

RUN apk update && apk add tesseract-ocr
RUN apk update && apk add tesseract-ocr-data-eng

This fails to find the eng language package. During the build process, repo is listed and it is clear that it does not have the eng package.这找不到eng语言package。在构建过程中,列出了repo ,很明显它没有eng package。

I am able to install the enm package, but I feel like there will be issues since it is for middle english.我可以安装 enm package,但我觉得会有问题,因为它是针对中英文的。

Has anyone had success installing the eng package on Alpine?有没有人在 Alpine 上成功安装了 eng package?

If you look at the content one of those packages for a language, for example the tesseract-ocr-data-enm one, you will quickly realise it contains only one file:如果您查看一种语言的那些包中的内容,例如tesseract-ocr-data-enm ,您会很快意识到它只包含一个文件:

  • /usr/share/tessdata/enm.traineddata /usr/share/tessdata/enm.traineddata

Source: https://pkgs.alpinelinux.org/contents?name=tesseract-ocr-data-enm&branch=v3.17&arch=aarch64来源: https://pkgs.alpinelinux.org/contents?name=tesseract-ocr-data-enm&branch=v3.17&arch=aarch64

Now, if you reverse engineer it, you can try to find which package does contains the file /usr/share/tessdata/eng.traineddata , and it is, with no big surprise, the default package: tesseract-ocr .现在,如果您对它进行逆向工程,您可以尝试找到哪个 package 确实包含文件/usr/share/tessdata/eng.traineddata ,毫不奇怪,它是默认的 package: tesseract-ocr

Source: https://pkgs.alpinelinux.org/contents?file=eng.traineddata&branch=v3.17&arch=aarch64来源: https://pkgs.alpinelinux.org/contents?file=eng.traineddata&branch=v3.17&arch=aarch64

So, your Dockerfile should simply be:因此,您的Dockerfile应该只是:

FROM eclipse-temurin:17-jre-alpine as tesseract-master

RUN apk add --no-cache \
      tesseract-ocr

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM