简体   繁体   English

使用火花流从 kafka 读取数据时出现 lz4 异常

[英]lz4 exception when reading data from kafka using spark streaming

I am trying to read json data from kafka using spark streaming api, when I do that it throws java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.init exception.我正在尝试使用 spark 流 api 从 kafka 读取 json 数据,当我这样做时,它会抛出java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.init异常。 Stack trace is -堆栈跟踪是 -

java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163)
at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:124)
at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:50)
at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:50)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:421)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.streaming.StateStoreRestoreExec$$anonfun$doExecute$1.apply(statefulOperators.scala:217)
at org.apache.spark.sql.execution.streaming.StateStoreRestoreExec$$anonfun$doExecute$1.apply(statefulOperators.scala:215)
at org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:67)
at org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:62)
at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:78)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:77)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

my pom.xml file is having following depedencies-我的 pom.xml 文件具有以下依赖项-

    <!-- https://mvnrepository.com/artifact/net.jpountz.lz4/lz4 -->
    <dependency>
        <groupId>net.jpountz.lz4</groupId>
        <artifactId>lz4</artifactId>
        <version>1.3.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_2.11</artifactId>
        <version>2.3.1</version>
        <exclusions>
            <exclusion>
                <artifactId>lz4-java</artifactId>
                <groupId>org.lz4</groupId>
            </exclusion>
        </exclusions>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>2.3.1</version>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
        <version>2.3.1</version>
        <scope>provided</scope>
    </dependency>

    <!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka -->
    <dependency>
        <groupId>org.apache.kafka</groupId>
        <artifactId>kafka_2.11</artifactId>
        <version>1.1.0</version>
    </dependency>

And the spark streaming class to show how I am trying the read kafka value as string and then parsing the same into Person class using custom parser-和 spark 流类显示我如何尝试将 kafka 值作为字符串读取,然后使用自定义解析器将其解析为 Person 类 -

  public static void main( String[] args ) throws Exception
{
    if( args.length < 3 )
    {
        System.err
                .println("Usage: JavaStructuredKafkaWordCount <bootstrap-servers> " + "<subscribe-type> <topics>");
        System.exit(1);
    }

    String bootstrapServers = args[0];
    String subscribeType = args[1];
    String topics = args[2];

    SparkSession spark = SparkSession.builder().appName("JavaStructuredKafkaWordCount")
            .config("spark.master", "local").getOrCreate();


    // Create DataSet representing the stream of input lines from kafka
    Dataset<String> df = spark.readStream().format("kafka").option("kafka.bootstrap.servers", bootstrapServers)
            .option(subscribeType, topics).load().selectExpr("CAST(value AS STRING)").as(Encoders.STRING());

    Dataset<Person> stringMein = df.map(
            (MapFunction<String, Person>) row -> JSONToPerson.parseJsonToPerson(row),
            Encoders.bean(Person.class));

    //stringMein.printSchema();
    // Generate running word count
    Dataset<Row> cardDF = stringMein.groupBy("age").count();
    // Start running the query that prints the running counts to the console
    StreamingQuery query = cardDF.writeStream().outputMode("update").format("console").start();

    query.awaitTermination();
}

} }

Better option is, add this line to your Spark conf while initializing SparkSession. 更好的选择是,在初始化SparkSession时将此行添加到Spark conf中。

.config("spark.io.compression.codec", "snappy")

Other option is you can add exclusion rule for net.jpountz.lz4 in build.sbt. 其他选项是您可以在build.sbt中为net.jpountz.lz4添加排除规则。

lazy val excludeJars = ExclusionRule(organization = "net.jpountz.lz4", name = "lz4")

Adding the next dependency , works for me: 添加下一个依赖项,对我有用:

<dependency>
<groupId>net.jpountz.lz4</groupId>
<artifactId>lz4</artifactId>
<version>1.3.0</version>
</dependency>enter code here

In my case, class CompressionCodecName is present in two transitive dependencies with Maven coordinates 1) org.apache.hive:hive-exec:jar:2.1.1-cdh6.2.1:compile and 2) org.apache.parquet:parquet-common:jar:1.10.0:compile .就我而言,类CompressionCodecName存在于两个具有 Maven 坐标的传递依赖项中 1) org.apache.hive:hive-exec:jar:2.1.1-cdh6.2.1:compile和 2) org.apache.parquet:parquet-common:jar:1.10.0:compile

The error is due to hive-exec classpath precedence which don't have Lz4Codec.该错误是由于没有 Lz4Codec 的 hive-exec 类路径优先级造成的。 I am able to resolve this issue by placing org.apache.spark:spark-sql_2.11:2.40 before org.apache.spark:spark-hive_2.11:2.4.0-cdh6.2.1 as shown below,我可以通过将org.apache.spark:spark-sql_2.11:2.40放在org.apache.spark:spark-sql_2.11:2.40 org.apache.spark:spark-hive_2.11:2.4.0-cdh6.2.1之前来解决这个问题,如下所示,

   <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_2.11</artifactId>
        <version>2.4.0</version>
    </dependency>

    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-hive_2.11</artifactId>
        <version>2.4.0-cdh6.2.1</version>
        <exclusions>
            <exclusion>
                <groupId>org.apache.thrift</groupId>
                <artifactId>libthrift</artifactId>
            </exclusion>
            <exclusion>
                <artifactId>commons-codec</artifactId>
                <groupId>commons-codec</groupId>
            </exclusion>
            <exclusion>
                <groupId>commons-cli</groupId>
                <artifactId>commons-cli</artifactId>
            </exclusion>
        </exclusions>
    </dependency>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 InputStreamReader 从 lz4 文件流式传输内容 - Stream 损坏 - Java - Streaming contents from lz4 file using InputStreamReader - Stream Corrupted - Java 使用kafka进行火花流传输一位消费者正在读取数据 - spark streaming with kafka one consumer is reading the data 在 Java 中使用 LZ4 添加到现有的 .lz4 (zip) - Using LZ4 to Add to an existing .lz4 (zip) in Java 使用LZ4解压缩byte [] - Decompressing byte[] using LZ4 获取NotSerializableException - 将Spark Streaming与Kafka一起使用时 - Getting NotSerializableException - When using Spark Streaming with Kafka 在 Java 中对多个文件使用 LZ4 压缩 - Using LZ4 Compression in Java for multiple files 提交 jar 文件时,控制台中未打印来自 Kafka 的数据。 (Spark 流 + Kafka 集成 3.1.1) - Data from Kafka is not printed in console when I submmited jar file. (Spark streaming + Kafka integration 3.1.1) 使用火花流从数据库中读取流 - Stream reading from database using spark streaming “格式错误的数据长度为负”,当尝试将来自 kafka 的 Spark 结构化流与 Avro 数据源结合使用时 - “Malformed data length is negative”, when trying to use spark structured streaming from kafka with Avro data source 使用Apache Kafka生成数据并使用Spark Streaming接收数据 - Generate data with apache kafka and receive it using spark streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM