简体   繁体   中英

Apache Tika on python extracts text from pdf on MacBook Pro but not Windows server

As above, I am extracting text from multiple documents using tika in python, but on one particular pdf, it is extracting the text on my development machine (MacBook Pro) but not on Windows Server 2012, where it returns a 'NoneType'.

Very confusing, at first I thought libraries, but it's using the same jar file from apache (1.19.1)

try:
    headers = {'X-Tika-PDFextractInlineImages': 'true',}  
    data = parser.from_file(pathtofile, serverEndpoint=self.TIKA_SERVER, headers=headers)
    charstoreturn = data['content'].strip().split()[:limit]
    charstoreturn = ' '.join(charstoreturn).replace("\n", " ").replace('"', "'").replace(",","").replace("’","'")
    return True, charstoreturn
except Exception as err:
    return False, "error {} on file: {}.\n".format(str(err), pathtofile)

Where TIKA_SERVER is ' http://localhost:1234 ' pathtofile is the file I am testing with that is failing

Error on windows: error 'NoneType' object has no attribute 'strip' on file: \\testdata\\test2.pdf.

Any ideas?

The python tika wrapper is returning None, so you need to dig into why that happened.

Is the tika server running? If not, why not? Do you have a suitable Java VM installed for it to use? Do you have permission to execute the jar? Does the Python code make assumptions about your Windows system that are not true (eg that jar's are executable, or that the default VM is the correct one etc).

If the tika server is running then does tika work properly or give some other errors? If you put a PDF through a tika server you start from the same jar does that work or give you an error? Can you debug to see what, if any, errors come back from the web request in the python library (breakpoint etc)?

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM