简体   繁体   English

不能两次读取相同的InputStream

[英]Can't read the same InputStream twice

This is my code: 这是我的代码:

// getFile() method returns the input stream of a local or online file
InputStream fileStream = getFile(source);
// Convert an InputStream to an InputSource
org.xml.sax.InputSource fileSource = new org.xml.sax.InputSource(fileStream);
// Extract text via the Boilerpipe DefaultExtractor
String text = DefaultExtractor.INSTANCE.getText(fileSource);

// Extract text and metadata via Apache Tika
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fileStream, handler, metadata, context);

I can't figure out why just the first extractor works. 我不知道为什么只有第一个提取器起作用。

In this case just Boilerpipe (the first extractor) works, while Apache Tika (the second extractor) is not able to extract anything. 在这种情况下,仅Boilerpipe (第一个提取器)起作用,而Apache Tika (第二个提取器)无法提取任何东西。

I tried to create a copy of fileStream (via InputStream fileStream2 = fileStream; ) and to pass fileStream to one reader and fileStream2 to another reader, but it didn't work either. 我试图创建fileStream的副本(通过InputStream fileStream2 = fileStream; ),并将fileStream传递给一个阅读器,并将fileStream2传递给另一阅读器,但是它也不起作用。

I also tried passing to Boilerpipe the HTML extracted from fileStream , and fileStream to Tika, but the result was the same. 我也尝试将从fileStream提取的HTML传递到Boilerpipe,然后将fileStream传递给Tika,但是结果是相同的。

I suspect that the problem is that the same InputStream cannot be read twice. 我怀疑问题是同一InputStream无法读取两次。

Could you please help me how to pass the content of 1 InputStream to 2 readers? 您能帮我如何将1 InputStream的内容传递给2个读者吗?

EDIT: I found the solution and I posted it below 编辑:我找到了解决方案,并将其张贴在下面

If you have a maven project, you have to include these dependencies (in your pom.xml ) in order that boilerpipe could work: 如果您有一个maven项目,则必须包括以下依赖项(在pom.xml ),以便boilerpipe可以工作:

 <dependency>
        <groupId>xerces</groupId>
        <artifactId>xercesImpl</artifactId>
        <version>x.y.z</version>
 </dependency>
 <dependency>
        <groupId>net.sourceforge.nekohtml</groupId>
        <artifactId>nekohtml</artifactId>
        <version>x.y.z</version>
</dependency>

I find out that an InputStream can't be read twice as Tika and Boilerpipe did in my old code, so I figured out that I could read fileStream and convert it to String , pass it to Boilerpipe, convert the String to a ByteArrayInputStream and pass that to Tika. 我发现InputStream不能像我的旧代码中的Tika和Boilerpipe一样被读取两次,所以我发现我可以读取fileStream并将其转换为String ,将其传递给Boilerpipe,将String转换为ByteArrayInputStream并传递提卡。 This is my new code. 这是我的新代码。

// getFile() method returns the input stream of a local or online file
InputStream fileStream = getFile(source);

// Read the value of the InputStream and pass it to the
// Boilerpipe DefaultExtractor in order to extract the text
String html = readFromStream(fileStream);
String text = DefaultExtractor.INSTANCE.getText(html);

// Convert the value read from fileStream to a new ByteArrayInputStream
fileStream = new ByteArrayInputStream(html.getBytes("UTF-8"));

// Extract text and metadata via Apache Tika
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fileStream, handler, metadata, context);

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM