简体   繁体   中英

Tika - Out Of memory Exception

I have been working on Tika to extract text contents only from various files. I found a peculiar issue when I was parsing a doc file with images inside. The Image fetcher was called and it threw java.lang.OutOfMemoryError: Java heap space .

I was trying the same in tika-app 1.22 gui and was getting the following exceptions:

Exception in thread "Image Fetcher 2" java.lang.OutOfMemoryError: Java heap space
    at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75)
    at java.awt.image.Raster.createPackedRaster(Raster.java:467)
    at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032)
    at sun.awt.image.ImageRepresentation.createBufferedImage(ImageRepresentation.java:253)
    at sun.awt.image.ImageRepresentation.setPixels(ImageRepresentation.java:559)
    at sun.awt.image.ImageDecoder.setPixels(ImageDecoder.java:138)
    at sun.awt.image.PNGImageDecoder.sendPixels(PNGImageDecoder.java:549)
    at sun.awt.image.PNGImageDecoder.produceImage(PNGImageDecoder.java:470)
    at sun.awt.image.InputStreamImageSource.doFetch(InputStreamImageSource.java:269)
    at sun.awt.image.ImageFetcher.fetchloop(ImageFetcher.java:205)
    at sun.awt.image.ImageFetcher.run(ImageFetcher.java:169)
Exception in thread "Image Fetcher 0" java.lang.OutOfMemoryError: Java heap space
    at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75)
    at java.awt.image.Raster.createPackedRaster(Raster.java:467)
    at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032)
    at sun.awt.image.ImageRepresentation.createBufferedImage(ImageRepresentation.java:253)
    at sun.awt.image.ImageRepresentation.setPixels(ImageRepresentation.java:559)
    at sun.awt.image.ImageDecoder.setPixels(ImageDecoder.java:138)
    at sun.awt.image.PNGImageDecoder.sendPixels(PNGImageDecoder.java:549)
    at sun.awt.image.PNGImageDecoder.produceImage(PNGImageDecoder.java:470)
    at sun.awt.image.InputStreamImageSource.doFetch(InputStreamImageSource.java:269)
    at sun.awt.image.ImageFetcher.fetchloop(ImageFetcher.java:205)
    at sun.awt.image.ImageFetcher.run(ImageFetcher.java:169)
Exception in thread "Image Fetcher 1" java.lang.OutOfMemoryError: Java heap space
    at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75)
    at java.awt.image.Raster.createPackedRaster(Raster.java:467)
    at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032)
    at sun.awt.image.ImageRepresentation.createBufferedImage(ImageRepresentation.java:253)
    at sun.awt.image.ImageRepresentation.setPixels(ImageRepresentation.java:559)
    at sun.awt.image.ImageDecoder.setPixels(ImageDecoder.java:138)
    at sun.awt.image.PNGImageDecoder.sendPixels(PNGImageDecoder.java:549)
    at sun.awt.image.PNGImageDecoder.produceImage(PNGImageDecoder.java:470)
    at sun.awt.image.InputStreamImageSource.doFetch(InputStreamImageSource.java:269)
    at sun.awt.image.ImageFetcher.fetchloop(ImageFetcher.java:205)
    at sun.awt.image.ImageFetcher.run(ImageFetcher.java:169)

My questions are:

  1. Why would I need to fetch the image for text only extraction from documents?
  2. How do I configure Tika to skip fetching images like in this case. I don't wish to increase my heap memory to solve this instead gracefully skip the images.

Edit: I am reading a file as stream and wrapped it as tikaInputStream and then opened an outputStream to write the result in another file.

        outputWriter = new OutputStreamWriter(outputStream);
        WriteOutContentHandler writeOutContentHandler = new WriteOutContentHandler(outputWriter, writeLimit);

        AutoDetectParser parser = new AutoDetectParser();  
        Metadata metadata = new Metadata();

        parser.parse(inputStream, writeOutContentHandler, metadata);

I have attached the file which I used for testing and got the following exception:

org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4f209819
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
    at tika.Tikaimpl.main(Tikaimpl.java:49)
Caused by: java.lang.IndexOutOfBoundsException: Block 96991 not found
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:434)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.readBAT(POIFSFileSystem.java:406)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.readCoreContents(POIFSFileSystem.java:359)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:239)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:172)
    at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:121)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
    ... 3 more
Caused by: java.lang.IndexOutOfBoundsException: Position 49659904 past the end of the file
    at org.apache.poi.poifs.nio.FileBackedDataSource.read(FileBackedDataSource.java:88)
    at org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:432)
    ... 9 more

File used for testing.

Increase the size of -XX:MaxPermSize upto 1GB

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM