简体   繁体   English

空解析器和 Tika 服务器模式

[英]Empty Parser and Tika Server mode

I am having trouble understanding how parsers are loaded into Tika.我无法理解解析器是如何加载到 Tika 中的。 From their documentation it appears that Tika-app comes prepackaged with the parsers ( https://tika.apache.org/1.17/gettingstarted.html ).从他们的文档来看,Tika-app 似乎与解析器一起预先打包 ( https://tika.apache.org/1.17/gettingstarted.html )。 When I run this command to start the server though当我运行此命令来启动服务器时

    ./.java-buildpack/open_jdk_jre/bin/java -jar ./lib/tika-app-1.24.1.jar -s --port ${PORT}

    2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR Nov 02, 2020 7:30:26 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
   2020-11-02T13:30:26.04-0600 [APP/PROC/WEB/0] ERR for optional dependencies.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR Nov 02, 2020 7:30:26 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR WARNING: org.xerial's sqlite-jdbc is not loaded.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR Please provide the jar on your classpath to parse sqlite files.
   2020-11-02T13:30:26.53-0600 [APP/PROC/WEB/0] ERR See tika-parsers/pom.xml for the correct version.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] OUT Successfully started tika-app's server on port: 8080
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR WARNING: The server option in tika-app is deprecated and will be removed
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR by Tika 2.0 if not shortly after Tika 1.14.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR Please migrate to the JAX-RS tika-server package.
   2020-11-02T13:30:26.80-0600 [APP/PROC/WEB/0] ERR See https://wiki.apache.org/tika/TikaJAXRS for usage.
   2020-11-02T13:31:25.66-0600 [HEALTH/0] ERR Failed to make HTTP request to '/version' on port 8080: timed out after 1.00 seconds
   2020-11-02T13:31:25.66-0600 [CELL/0] ERR Timed out after 1m0s: health check never passed.

I have the most recent tika version 1.24.1.我有最新的 tika 版本 1.24.1。 Their documentation mentions downloading tika-server and passing classpath at runtime to point to a tika-parsers.jar ( https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-ParsersMissing ) but I can't find the parsers.jar file anywhere.他们的文档提到在运行时下载 tika-server 并传递类路径以指向 tika-parsers.jar ( https://cwiki.apache.org/confluence/display/TIKA/Troubleshooting+Tika#TroubleshootingTika-ParsersMissing ) 但我不能t 在任何地方找到 parsers.jar 文件。 I am using openjdk-jre-1.8.0 to run this.我正在使用 openjdk-jre-1.8.0 来运行它。

The parsers should be bundled by default.默认情况下应该捆绑解析器。 Tika App in server mode (-s) is socket based server.服务器模式 (-s) 中的 Tika 应用程序是基于套接字的服务器。 You can confirm it is working by using netcat and seeing if you get a response:您可以通过使用 netcat 并查看是否收到响应来确认它是否正常工作:

nc localhost 8080 -q2 < test.pdf

To use this in Python you would need to write custom code open a socket and send the input in, send a SHUT_WR, and read the output back.要在 Python 中使用它,您需要编写自定义代码打开一个套接字并发送输入,发送 SHUT_WR,然后读回输出。

If you are using tika-python library, it is expecting to use a Tika Server which is in the tika-server JAR not the tika-app JAR.如果您使用的是 tika-python 库,则期望使用位于 tika-server JAR 中的Tika Server ,而不是 tika-app JAR。 It has some helper settings so you can point to the JAR, or you can host your own instance (self run or docker) and give it the URL.它有一些帮助程序设置,因此您可以指向 JAR,或者您可以托管自己的实例(自运行或 docker)并为其提供 URL。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM