When I extract spanish text from a PDF with PDFBox, accents are changed by “strange” characters

Question

I have this code in java to take a PDF file and extract all the text:

File file= new File("C:/file.pdf");
PDDocument doc= PDDocument.load(file);
PDFTextStripper s = new PDFTextStripper();
content= s.getText(doc);
System.out.println(content)

If we run the application with Windows, it works correctly and extracts all the text. However, when we pass the app to the server that uses Linux, the spanish accents are converted into "strange" characters like --> "carÃ¡cter" (it should be "carácter"). I tried to convert the String to bytes and then to UTF8 unicode:

byte[] b = content.getBytes(Charset.forName("UTF-8"));
String text= new String(b);
System.out.println(text);

But it does not work, in Windows it continues working well but in the Linux server it still shows wrong the spanish accents, etc ... I understand that if in a Windows environment it works correctly, in a Linux environment it should have to work too ... Any idea of What can it be or what can I do? Thank you

Answer 1

Ã¡ is what you get when the UTF-8 encoded form of á is misinterpreted as Latin-1.

There are two possibilities for this to happen:

a bug in PDFTextStripper.getText() - Java strings are UTF-16 encoded, but getText() may be returning a string containing UTF-8 byte octets that have been expanded as-is to 16-bit Java chars, thus producing 2 chars 0x00C3 0x00A1 instead of 1 char 0x00E1 for á . Subsequently calling content.getBytes(UTF8) on such a malformed string would just give you more corrupted data.
To "fix" this kind of mistake, loop through the string copying its chars as-is to a byte[] array, and then decode that array as UTF-8:
```
 byte[] b = new byte[content.length()]; for (int i = 0; i < content.length(); ++i) { b[i] = (byte) content[i]; } String text = new String(b, "UTF-8"); System.out.println(text); 
```
a configuration mismatch - PDFTextStripper.getText() may be returning a properly encoded UTF-16 string containing a á char as expected, but then System.out.println() outputs the UTF-8 encoded form of that string , and your terminal/console misinterprets the output as Latin-1 instead of as UTF-8.
In this case, the code you shown is fine, you would just need to double-check your Java environment and terminal/console configuration to make sure they agree on the charset used for console output.

You need to check the actual char values in content to know which case is actually happening.

When I extract spanish text from a PDF with PDFBox, accents are changed by “strange” characters

Question

1 answers

solution1
0 2019-03-12 21:15:21

When I extract spanish text from a PDF with PDFBox, accents are changed by “strange” characters

Question

1 answers

solution1 0 2019-03-12 21:15:21

solution1
0 2019-03-12 21:15:21