简体   繁体   中英

Can't read the same InputStream twice

This is my code:

// getFile() method returns the input stream of a local or online file
InputStream fileStream = getFile(source);
// Convert an InputStream to an InputSource
org.xml.sax.InputSource fileSource = new org.xml.sax.InputSource(fileStream);
// Extract text via the Boilerpipe DefaultExtractor
String text = DefaultExtractor.INSTANCE.getText(fileSource);

// Extract text and metadata via Apache Tika
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fileStream, handler, metadata, context);

I can't figure out why just the first extractor works.

In this case just Boilerpipe (the first extractor) works, while Apache Tika (the second extractor) is not able to extract anything.

I tried to create a copy of fileStream (via InputStream fileStream2 = fileStream; ) and to pass fileStream to one reader and fileStream2 to another reader, but it didn't work either.

I also tried passing to Boilerpipe the HTML extracted from fileStream , and fileStream to Tika, but the result was the same.

I suspect that the problem is that the same InputStream cannot be read twice.

Could you please help me how to pass the content of 1 InputStream to 2 readers?

EDIT: I found the solution and I posted it below

If you have a maven project, you have to include these dependencies (in your pom.xml ) in order that boilerpipe could work:

 <dependency>
        <groupId>xerces</groupId>
        <artifactId>xercesImpl</artifactId>
        <version>x.y.z</version>
 </dependency>
 <dependency>
        <groupId>net.sourceforge.nekohtml</groupId>
        <artifactId>nekohtml</artifactId>
        <version>x.y.z</version>
</dependency>

I find out that an InputStream can't be read twice as Tika and Boilerpipe did in my old code, so I figured out that I could read fileStream and convert it to String , pass it to Boilerpipe, convert the String to a ByteArrayInputStream and pass that to Tika. This is my new code.

// getFile() method returns the input stream of a local or online file
InputStream fileStream = getFile(source);

// Read the value of the InputStream and pass it to the
// Boilerpipe DefaultExtractor in order to extract the text
String html = readFromStream(fileStream);
String text = DefaultExtractor.INSTANCE.getText(html);

// Convert the value read from fileStream to a new ByteArrayInputStream
fileStream = new ByteArrayInputStream(html.getBytes("UTF-8"));

// Extract text and metadata via Apache Tika
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fileStream, handler, metadata, context);

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM