[英]lz4 exception when reading data from kafka using spark streaming
I am trying to read json data from kafka using spark streaming api, when I do that it throws java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.init exception.我正在尝试使用 spark 流 api 从 kafka 读取 json 数据,当我这样做时,它会抛出java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.init异常。 Stack trace is -
堆栈跟踪是 -
java.lang.NoSuchMethodError: net.jpountz.lz4.LZ4BlockInputStream.<init>(Ljava/io/InputStream;Z)V
at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:122)
at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:163)
at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:124)
at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:50)
at org.apache.spark.shuffle.BlockStoreShuffleReader$$anonfun$3.apply(BlockStoreShuffleReader.scala:50)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:421)
at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:61)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:435)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:441)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:30)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage3.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
at org.apache.spark.sql.execution.streaming.StateStoreRestoreExec$$anonfun$doExecute$1.apply(statefulOperators.scala:217)
at org.apache.spark.sql.execution.streaming.StateStoreRestoreExec$$anonfun$doExecute$1.apply(statefulOperators.scala:215)
at org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:67)
at org.apache.spark.sql.execution.streaming.state.package$StateStoreOps$$anonfun$1.apply(package.scala:62)
at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:78)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:77)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
my pom.xml file is having following depedencies-我的 pom.xml 文件具有以下依赖项-
<!-- https://mvnrepository.com/artifact/net.jpountz.lz4/lz4 -->
<dependency>
<groupId>net.jpountz.lz4</groupId>
<artifactId>lz4</artifactId>
<version>1.3.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.3.1</version>
<exclusions>
<exclusion>
<artifactId>lz4-java</artifactId>
<groupId>org.lz4</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.3.1</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.11</artifactId>
<version>2.3.1</version>
<scope>provided</scope>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.kafka/kafka -->
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.11</artifactId>
<version>1.1.0</version>
</dependency>
And the spark streaming class to show how I am trying the read kafka value as string and then parsing the same into Person class using custom parser-和 spark 流类显示我如何尝试将 kafka 值作为字符串读取,然后使用自定义解析器将其解析为 Person 类 -
public static void main( String[] args ) throws Exception
{
if( args.length < 3 )
{
System.err
.println("Usage: JavaStructuredKafkaWordCount <bootstrap-servers> " + "<subscribe-type> <topics>");
System.exit(1);
}
String bootstrapServers = args[0];
String subscribeType = args[1];
String topics = args[2];
SparkSession spark = SparkSession.builder().appName("JavaStructuredKafkaWordCount")
.config("spark.master", "local").getOrCreate();
// Create DataSet representing the stream of input lines from kafka
Dataset<String> df = spark.readStream().format("kafka").option("kafka.bootstrap.servers", bootstrapServers)
.option(subscribeType, topics).load().selectExpr("CAST(value AS STRING)").as(Encoders.STRING());
Dataset<Person> stringMein = df.map(
(MapFunction<String, Person>) row -> JSONToPerson.parseJsonToPerson(row),
Encoders.bean(Person.class));
//stringMein.printSchema();
// Generate running word count
Dataset<Row> cardDF = stringMein.groupBy("age").count();
// Start running the query that prints the running counts to the console
StreamingQuery query = cardDF.writeStream().outputMode("update").format("console").start();
query.awaitTermination();
}
} }
Better option is, add this line to your Spark conf while initializing SparkSession. 更好的选择是,在初始化SparkSession时将此行添加到Spark conf中。
.config("spark.io.compression.codec", "snappy")
Other option is you can add exclusion rule for net.jpountz.lz4 in build.sbt. 其他选项是您可以在build.sbt中为net.jpountz.lz4添加排除规则。
lazy val excludeJars = ExclusionRule(organization = "net.jpountz.lz4", name = "lz4")
Adding the next dependency , works for me: 添加下一个依赖项,对我有用:
<dependency>
<groupId>net.jpountz.lz4</groupId>
<artifactId>lz4</artifactId>
<version>1.3.0</version>
</dependency>enter code here
In my case, class CompressionCodecName
is present in two transitive dependencies with Maven coordinates 1) org.apache.hive:hive-exec:jar:2.1.1-cdh6.2.1:compile
and 2) org.apache.parquet:parquet-common:jar:1.10.0:compile
.就我而言,类
CompressionCodecName
存在于两个具有 Maven 坐标的传递依赖项中 1) org.apache.hive:hive-exec:jar:2.1.1-cdh6.2.1:compile
和 2) org.apache.parquet:parquet-common:jar:1.10.0:compile
。
The error is due to hive-exec classpath precedence which don't have Lz4Codec.该错误是由于没有 Lz4Codec 的 hive-exec 类路径优先级造成的。 I am able to resolve this issue by placing
org.apache.spark:spark-sql_2.11:2.40
before org.apache.spark:spark-hive_2.11:2.4.0-cdh6.2.1
as shown below,我可以通过将
org.apache.spark:spark-sql_2.11:2.40
放在org.apache.spark:spark-sql_2.11:2.40
org.apache.spark:spark-hive_2.11:2.4.0-cdh6.2.1
之前来解决这个问题,如下所示,
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.0</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>2.4.0-cdh6.2.1</version>
<exclusions>
<exclusion>
<groupId>org.apache.thrift</groupId>
<artifactId>libthrift</artifactId>
</exclusion>
<exclusion>
<artifactId>commons-codec</artifactId>
<groupId>commons-codec</groupId>
</exclusion>
<exclusion>
<groupId>commons-cli</groupId>
<artifactId>commons-cli</artifactId>
</exclusion>
</exclusions>
</dependency>
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.