How can I specify encoding when parsing text with Apache TIKA?

Question

The question is pretty self-explanatory.

The problem I am facing is that any Tika example code I found online is using a StringWriter, as shown below. If i could somehow make this use an OutputStreamWriter, I can specify the encoding no problem... Any help would be appreciated.

InputStream inStream = new FileInputStream(pathname);
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD,"html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT,"no");
handler.setResult(new StreamResult(sw));
parser.parse(inStream, handler, metadata, context);

Answer 1

You can set the encoding by the metadata object. I've used this snippet:

import org.apache.tika.metadata.Metadata;

Metadata metadata = new Metadata();
metadata.add(Metadata.CONTENT_ENCODING, DATAFILE_CHARSET);
String parsedString = tika.parseToString(inputStream, metadata);

By default tika tries to determine the encoding itself, when parsing html. But sometimes this could lead to errors.

Answer 2

如果解析文本，则可以使用传统IO指定编码。

How can I specify encoding when parsing text with Apache TIKA?

Question

2 answers

solution1
2 2013-11-27 18:59:41

solution2
-2 2013-09-19 06:25:49

How can I specify encoding when parsing text with Apache TIKA?

Question

2 answers

solution1 2 2013-11-27 18:59:41

solution2 -2 2013-09-19 06:25:49

solution1
2 2013-11-27 18:59:41

solution2
-2 2013-09-19 06:25:49