简体   繁体   English

Scala Spark中的groupBy函数是否需要Lzocodec?

[英]Do we need Lzocodec for groupBy function in Scala Spark?

I have made a function in Scala Spark which looks like this. 我在Scala Spark中做了一个看起来像这样的函数。

def prepareSequences(data: RDD[String], splitChar: Char = '\t') = {
    val x = data.map(line => {
    val Array(id, se, offset, hour) = line.split(splitChar)
    (id + "-" + se,
    Step(offset = if (offset == "NULL") {
    -5
    } else {
    offset.toInt
    },
    hour = hour.toInt))
    })

    val y = x.groupBy(_._1)}

I need the groupBy but as soon as I add it, I am getting an error. 我需要groupBy但是一旦添加,就会出现错误。 The error is asking for Lzocodec . 该错误正在请求Lzocodec

        Exception in thread "main" java.lang.RuntimeException: Error in configuring object
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:112)
    at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:78)
    at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:136)
    at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:188)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:201)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
    at scala.Option.getOrElse(Option.scala:121)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
    at org.apache.spark.Partitioner$$anonfun$defaultPartitioner$2.apply(Partitioner.scala:66)
    at org.apache.spark.Partitioner$$anonfun$defaultPartitioner$2.apply(Partitioner.scala:66)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
    at scala.collection.immutable.List.map(List.scala:285)
    at org.apache.spark.Partitioner$.defaultPartitioner(Partitioner.scala:66)
    at org.apache.spark.rdd.RDD$$anonfun$groupBy$1.apply(RDD.scala:687)
    at org.apache.spark.rdd.RDD$$anonfun$groupBy$1.apply(RDD.scala:687)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.groupBy(RDD.scala:686)
    at com.savagebeast.mypackage.DataPreprocessing$.prepareSequences(DataPreprocessing.scala:42)
    at com.savagebeast.mypackage.activity_mapper$.main(activity_mapper.scala:31)
    at com.savagebeast.mypackage.activity_mapper.main(activity_mapper.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: java.lang.reflect.InvocationTargetException
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:109)
    ... 44 more
    Caused by: java.lang.IllegalArgumentException: Compression codec com.hadoop.compression.lzo.LzoCodec not found.
    at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:139)
    at org.apache.hadoop.io.compress.CompressionCodecFactory.<init>(CompressionCodecFactory.java:180)
    at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45)
    ... 49 more
    Caused by: java.lang.ClassNotFoundException: Class com.hadoop.compression.lzo.LzoCodec not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2101)
    at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:132)
    ... 51 more

I installed lzo and other required things following this Class com.hadoop.compression.lzo.LzoCodec not found for Spark on CDH 5? 在CDH 5上找不到Spark时遵循此类com.hadoop.compression.lzo.LzoCodec安装了lzo和其他必需的东西吗?

Am I missing something? 我想念什么吗?

UPDATE: Found solution. 更新:找到解决方案。

Partitioning the RDD like this resolved the problem. 像这样对RDD进行分区可以解决此问题。

val y = x.groupByKey(50)

50 is the number of partitions I want for the RDD. 50是我要用于RDD的分区数。 It can be any number. 可以是任何数字。

However, I am not sure why this worked. 但是,我不确定为什么这样做。 Will appreciate if someone could explain. 如果有人可以解释将不胜感激。

UPDATE-2: The following worked more sensibly and is stable so far. UPDATE-2:到目前为止,以下内容的工作更为明智且稳定。

I copied hadoop-lzo-0.4.21-SNAPSHOT.jar from /Users/<username>/hadoop-lzo/target to /usr/local/Cellar/apache-spark/2.1.0/libexec/jars . 我从/Users/<username>/hadoop-lzo/target复制了hadoop-lzo-0.4.21-SNAPSHOT.jar/usr/local/Cellar/apache-spark/2.1.0/libexec/jars Essentially copying the jar to spark's classpath. 本质上是将jar复制到spark的classpath。

No. It is not required by groupBy . groupBy If you take a look at the stack trace (kudos for posting it) you'll see it fails somewhere in input format: 如果查看一下堆栈跟踪(发布它的荣誉),您会发现它在输入格式的某处失败:

 at org.apache.hadoop.mapred.TextInputFormat.configure(TextInputFormat.java:45) 

This suggests that your input is compressed. 这表明您的输入已压缩。 It fails when you call groupBy , because this the point where Spark has to decided on the number of partitions, and touch the input. 当您调用groupBy ,它将失败,因为这是Spark必须确定分区数并触摸输入的地方。

In practice - yes, it seems you need lzo codec to execute your job. 实际上-是的,看来您需要lzo编解码器才能执行工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM