简体   繁体   中英

How can I specify encoding when parsing text with Apache TIKA?

The question is pretty self-explanatory.

The problem I am facing is that any Tika example code I found online is using a StringWriter, as shown below. If i could somehow make this use an OutputStreamWriter, I can specify the encoding no problem... Any help would be appreciated.

InputStream inStream = new FileInputStream(pathname);
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD,"html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT,"no");
handler.setResult(new StreamResult(sw));
parser.parse(inStream, handler, metadata, context);

You can set the encoding by the metadata object. I've used this snippet:

import org.apache.tika.metadata.Metadata;

Metadata metadata = new Metadata();
metadata.add(Metadata.CONTENT_ENCODING, DATAFILE_CHARSET);
String parsedString = tika.parseToString(inputStream, metadata);

By default tika tries to determine the encoding itself, when parsing html. But sometimes this could lead to errors.

如果解析文本,则可以使用传统IO指定编码。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM