简体   繁体   English

使用ContentHandler提取文件的内容

[英]Extract contents of a file using ContentHandler

I'm trying to extract the contents of a txt file using ContentHandler, the below is my code and the contents of my file is 我正在尝试使用ContentHandler提取txt文件的内容,下面是我的代码,我的文件的内容是

Sample content Sample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample contentSample

The below code is not showing the extracted contents, what am I missing here? 下面的代码没有显示提取的内容,我在这里缺少什么?

class Test { 
private OutputStream outputstream;
private ParseContext context;
private Detector detector;
private Parser parser;
private Metadata metadata;
private String extractedText;

public Test() {
    context = new ParseContext();
    detector = new DefaultDetector();
    parser = new AutoDetectParser(detector);
    context.set(Parser.class, parser);
    outputstream = new ByteArrayOutputStream();
    metadata = new Metadata();
}

public void process(String filename) throws Exception {
    URL url;
    File file = new File(filename);
    if (file.isFile()) {
        url = file.toURI().toURL();
    } else {
        url = new URL(filename);
    }
    InputStream input = TikaInputStream.get(url, metadata);
    ContentHandler handler = new BodyContentHandler(outputstream);
    parser.parse(input, handler, metadata, context); 
    input.close();
}

public void getString() {
    //Get the text into a String object
    extractedText = outputstream.toString();
    //Do whatever you want with this String object.
    System.out.println("extracted text "+extractedText);
}

public static void main(String args[]) throws Exception {
    if (args.length == 1) {
        Test textExtractor = new Test();
        textExtractor.process("D:\\docs\\sample.txt");
        textExtractor.getString();
    } else { 
        throw new Exception();
    }
}
}

除了apache tika-core之外,还要添加apache tika-parsers依赖。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM