简体   繁体   中英

Strange 1 byte character result with pdftotext from .pdf to .txt

I have this weird result when transferring a single pdf with no content to a.txt file.

I am using this PHP code in a foreach for all the files found in the dir. It works ridiculously well with the -raw option if there is text available in the pdf.

system("pdftotext -raw $page_name 2>&1");

However, if there is no content, or the file just contains an image, it produces this code in the.txt file:

生产线的 Windows 屏幕截图

(view of Line 1 in the.txt file)

I've tried multiple pdftotext-settings, but can't seem to get rid of it.

Is there any way to tackle this with pdftotext?

Some further info: with that character, the file produced is always 1 byte. Where I'd like to have it listed as 0 bytes in the dir.

(ps. first time use of adding an image. Hope it is clear!)

Because of what I just (finally) found, I will close this one with this best answer from @mkl. In Bold is the answer to this question:

More exactly, that Worksheet PDF does not contain text drawing instructions, merely graphics drawing instructions (the results of which look like text) .

pdfminer pdf2text outputs 'FF'

The solution is reading that weird character when working with files that have this content.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM