[英]Can't read the same InputStream twice
This is my code: 这是我的代码:
// getFile() method returns the input stream of a local or online file
InputStream fileStream = getFile(source);
// Convert an InputStream to an InputSource
org.xml.sax.InputSource fileSource = new org.xml.sax.InputSource(fileStream);
// Extract text via the Boilerpipe DefaultExtractor
String text = DefaultExtractor.INSTANCE.getText(fileSource);
// Extract text and metadata via Apache Tika
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fileStream, handler, metadata, context);
I can't figure out why just the first extractor works. 我不知道为什么只有第一个提取器起作用。
In this case just Boilerpipe (the first extractor) works, while Apache Tika (the second extractor) is not able to extract anything. 在这种情况下,仅Boilerpipe (第一个提取器)起作用,而Apache Tika (第二个提取器)无法提取任何东西。
I tried to create a copy of fileStream
(via InputStream fileStream2 = fileStream;
) and to pass fileStream
to one reader and fileStream2
to another reader, but it didn't work either. 我试图创建
fileStream
的副本(通过InputStream fileStream2 = fileStream;
),并将fileStream
传递给一个阅读器,并将fileStream2
传递给另一阅读器,但是它也不起作用。
I also tried passing to Boilerpipe the HTML extracted from fileStream
, and fileStream
to Tika, but the result was the same. 我也尝试将从
fileStream
提取的HTML传递到Boilerpipe,然后将fileStream
传递给Tika,但是结果是相同的。
I suspect that the problem is that the same InputStream
cannot be read twice. 我怀疑问题是同一
InputStream
无法读取两次。
Could you please help me how to pass the content of 1 InputStream
to 2 readers? 您能帮我如何将1
InputStream
的内容传递给2个读者吗?
EDIT: I found the solution and I posted it below 编辑:我找到了解决方案,并将其张贴在下面
If you have a maven project, you have to include these dependencies (in your pom.xml
) in order that boilerpipe
could work: 如果您有一个maven项目,则必须包括以下依赖项(在
pom.xml
),以便boilerpipe
可以工作:
<dependency>
<groupId>xerces</groupId>
<artifactId>xercesImpl</artifactId>
<version>x.y.z</version>
</dependency>
<dependency>
<groupId>net.sourceforge.nekohtml</groupId>
<artifactId>nekohtml</artifactId>
<version>x.y.z</version>
</dependency>
I find out that an InputStream
can't be read twice as Tika and Boilerpipe did in my old code, so I figured out that I could read fileStream
and convert it to String
, pass it to Boilerpipe, convert the String
to a ByteArrayInputStream
and pass that to Tika. 我发现
InputStream
不能像我的旧代码中的Tika和Boilerpipe一样被读取两次,所以我发现我可以读取fileStream
并将其转换为String
,将其传递给Boilerpipe,将String
转换为ByteArrayInputStream
并传递提卡。 This is my new code. 这是我的新代码。
// getFile() method returns the input stream of a local or online file
InputStream fileStream = getFile(source);
// Read the value of the InputStream and pass it to the
// Boilerpipe DefaultExtractor in order to extract the text
String html = readFromStream(fileStream);
String text = DefaultExtractor.INSTANCE.getText(html);
// Convert the value read from fileStream to a new ByteArrayInputStream
fileStream = new ByteArrayInputStream(html.getBytes("UTF-8"));
// Extract text and metadata via Apache Tika
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
AutoDetectParser parser = new AutoDetectParser();
parser.parse(fileStream, handler, metadata, context);
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.