简体   繁体   中英

Increase tika heap size in Python with tika-python

Can someone suggest a way to give tika a larger heap size (1 GByte or so) while using tika-python (on Windows)?

I get "status: 500" errors from tika when processing very large Microsoft Word files. If I run tika from the Windows command line as follows, the errors go away:

C:>java -Xmx1G -jar tika-app-2.1.0.jar

The -Xmx1G specifies a maximum heap size of 1 GByte (much larger than the default).

I've seen several answers for other languages, but none specific for Python with tika-python.

I've tried:

os.environ["TIKA_JAVA_ARGS"] = "-Xmx1G"
from tika import parser as tika_parser 

and:

def main():  
    global MODEL_LIST   
    os.environ["TIKA_JAVA_ARGS"] = "-Xmx1G"
    start_time = time.time()
    ... rest of code ...

and from the Windows command line:

C:\<path>\findEm>set TIKA_JAVA_ARGS="-Xmx1G"
C:\<path>\findEm>python3 findEmv1.52.py

All 3 methods result in the same error, something like

2021-10-19 14:43:55,782 [MainThread  ] [WARNI]  Tika server returned status: 500

I think the main problem is that the Java tika process is already running when I'm trying to change the maximum heap size - somehow I need to kill that, set the heap size max, and restart the Java tika server. How?

Your suspicion about the process already running would indeed be correct. Leaving tika running in the background means when your script starts means it doesn't restart the java process with the new flag, which means no heap increase.

As to solving that issue, we can do it completely in Python on Windows with the help of psutil :

from typing import Optional
import psutil
from tika import tika as tika_server
from tika import parser

def get_tika_process() -> Optional[psutil.Process]:
    for process in psutil.process_iter(["name", "cmdline"]):
        if "java" in process.name():
            for part in process.cmdline():
                if "tika" in part:
                    return process

if existing_tika_process := get_tika_process():
    print("Found tika process:", existing_tika_process)
    print("Existing process args:", existing_tika_process.cmdline())
    existing_tika_process.terminate()
    terminate_result = existing_tika_process.wait(10)
    print(f"Terminated tika; exit code {terminate_result}")
else:
    print("No existing tika process found")


tika_server.TikaJavaArgs += "-Xmx1G"  # See note {1}
parsed = parser.from_file("spam.txt")
print("Tika server started")
new_tika_process = get_tika_process()
if new_tika_process:
    print("New process args:", new_tika_process.cmdline())



print(parsed["metadata"])
print(parsed["content"])

{1} I'm directly appending to tika_server.TikaJavaArgs as the environment variable is parsed when tika_server is imported. You can replace with setting the environment variable if you delay the import (as in the first attempt in the question).

Result:

(venv) PS E:\DevProjects\stack-exchange-answers\69637621> python .\main.py
No existing tika process found
2021-10-22 22:50:04,476 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
Tika server started
New process args: ['java', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
{'Content-Encoding': 'windows-1252', 'Content-Type': 'text/plain; charset=windows-1252', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '54', 'resourceName': "b'spam.txt'"}
<blank lines removed>
Spam
Spam
More Spam!

(venv) PS E:\DevProjects\stack-exchange-answers\69637621> python .\main.py
Found tika process: psutil.Process(pid=11244, name='java.exe', status='running', started='22:50:04')
Existing process args: ['java', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
Terminated tika; exit code 15
2021-10-22 22:54:40,016 [MainThread  ] [WARNI]  Failed to see startup log message; retrying...
Tika server started
New process args: ['java', '-Xmx1G', '-cp', 'C:\\Users\\user\\AppData\\Local\\Temp\\tika-server.jar', 'org.apache.tika.server.TikaServerCli', '--port', '9998', '--host', '0.0.0.0']
{'Content-Encoding': 'windows-1252', 'Content-Type': 'text/plain; charset=windows-1252', 'X-Parsed-By': ['org.apache.tika.parser.DefaultParser', 'org.apache.tika.parser.csv.TextAndCSVParser'], 'X-TIKA:content_handler': 'ToTextContentHandler', 'X-TIKA:embedded_depth': '0', 'X-TIKA:parse_time_millis': '55', 'resourceName': "b'spam.txt'"}
<blank lines removed>
Spam
Spam
More Spam!

(venv) PS E:\DevProjects\stack-exchange-answers\69637621>

You can definitely improve this (such as for instance, checking to see if your args are the same and skip terminating if they are), but this should get you going again at least.

Additionally, you should look into adding a call to tika.tika.killServer() at the end of your script to stop the server when you're done with it.

Can you try starting a tika server on your own (before running tika python) with your desired requirements for memory? Then try tika-python and see if it works? My guess is that your opts weren't being propagated to the server.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM