The question is pretty self-explanatory.
The problem I am facing is that any Tika example code I found online is using a StringWriter, as shown below. If i could somehow make this use an OutputStreamWriter, I can specify the encoding no problem... Any help would be appreciated.
InputStream inStream = new FileInputStream(pathname);
StringWriter sw = new StringWriter();
SAXTransformerFactory factory = (SAXTransformerFactory)SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD,"html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT,"no");
handler.setResult(new StreamResult(sw));
parser.parse(inStream, handler, metadata, context);
You can set the encoding by the metadata object. I've used this snippet:
import org.apache.tika.metadata.Metadata;
Metadata metadata = new Metadata();
metadata.add(Metadata.CONTENT_ENCODING, DATAFILE_CHARSET);
String parsedString = tika.parseToString(inputStream, metadata);
By default tika tries to determine the encoding itself, when parsing html. But sometimes this could lead to errors.
如果解析文本,则可以使用传统IO指定编码。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.