简体   繁体   中英

Apache dependency bug? org.apache.parquet.hadoop.codec.SnappyCodec was not found Error in apache library

Currently trying to read a parquet file in Java without the use of Spark. Here's what I have so far, based on Adam Melnyk's blog post on the subject .


        ParquetFileReader reader = ParquetFileReader.open(file);
        MessageType schema = reader.getFooter().getFileMetaData().getSchema();
        List<Type> fields = schema.getFields();
        PageReadStore pages;
-->     while ((pages = reader.readNextRowGroup()) != null) {
            long rows = pages.getRowCount();
            LOG.info("Number of rows: " + rows);
            MessageColumnIO columnIO = new ColumnIOFactory().getColumnIO(schema);
            RecordReader recordReader = columnIO.getRecordReader(pages, new GroupRecordConverter(schema));

            for (int i = 0; i < rows; i++) {
                SimpleGroup simpleGroup = (SimpleGroup) recordReader.read();

(note that the arrow is the line (167) that the error is thrown at in my code)

Error Message

org.apache.parquet.hadoop.BadConfigurationException: Class org.apache.parquet.hadoop.codec.SnappyCodec was not found
        at org.apache.parquet.hadoop.CodecFactory.getCodec(CodecFactory.java:243)
        at org.apache.parquet.hadoop.CodecFactory$HeapBytesDecompressor.<init>(CodecFactory.java:96)
        at org.apache.parquet.hadoop.CodecFactory.createDecompressor(CodecFactory.java:212)
        at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:201)
        at org.apache.parquet.hadoop.CodecFactory.getDecompressor(CodecFactory.java:42)
        at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1519)
        at org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:1402)
        at org.apache.parquet.hadoop.ParquetFileReader.readChunkPages(ParquetFileReader.java:1023)
        at org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:928)
        at [myClassPath]([myClass].java:167)



It seems as though the SnappyCodec class cannot be found from the CodecFactory class, but I looked into my referenced libraries and the class is there: referenced_libraries

CodecFactory should be able to recognize the SnappyCodec class. Any recommendations? Thanks

Found a solution.

So the problem was that the SnappyCodec class was being shaded by the maven shade plugin I have configured for my application.

I realized this after packaging the jar with maven, opening that jar with WinZip, and checking the codec directory of the packaged jar (where I found the SanppyCodec.class no longer existed).

The solution was that I needed to add the following filters to the configuration of my maven shade plugin:


Basically, maven-shade was shading seemingly random classes from the parquet-hadoop artifact, so by adding the <include> filter, maven-shade did NOT shade any of the classes inside it, thus not shading the SnappyCodec.class file within it.

After doing this, I needed to add the other two filters because by using the <include> tag on the parquet-hadoop artifact, it then excluded every other parquet-* artifact from being added to the compiled jar. So, I needed to explicitly tell it to include parquet-column and parquet-encoding as well since my application used some other classes within those artifacts.

This configuration meant that maven-shader would not touch these three artifacts, meaning that any and every class that was present within those artifacts before compile time would remain there after compiling/packaging them with maven (thus, would be there at runtime, whereas they weren't before, causing the original error). Awesome!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM