简体   繁体   中英

Text extracted by PDFBox does not contain international (non-English) characters

I'm using Apache PDFBox to extract text from several PDF files. The files are in Polish language and they contain Polish characters. Unfortunately, when I print the extracted text, I keep getting ? (question marks) instead of those characters.

Assuming your extracted text is stored in String s, I am assuming that you are currently using this to print -

System.out.println(s);

I suggest you use this snippet for printing out the polish characters properly-

java.io.PrintStream p = new java.io.PrintStream(System.out,false,"UTF-8");
p.println(s);

This should work and ? will not appear in the printed text.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM