读取保存在HBase列中的AVRO结构

Question

I am new to Spark and HBase. 我是Spark和HBase的新手。 I am working with the backups of a HBase table. 我正在使用HBase表的备份。 These backups are in a S3 bucket. 这些备份位于S3存储桶中。 I am reading them via spark(scala) using newAPIHadoopFile like this: 我正在使用newAPIHadoopFile通过spark（scala）读取它们，如下所示：

conf.set("io.serializations", "org.apache.hadoop.io.serializer.WritableSerialization,org.apache.hadoop.hbase.mapreduce.ResultSerialization")
val data = sc.newAPIHadoopFile(path,classOf[SequenceFileInputFormat[ImmutableBytesWritable, Result]], classOf[ImmutableBytesWritable], classOf[Result], conf)

The table in question is called Emps . 有问题的表称为Emps 。 The schema of Emps is : Emps的架构为：

key: empid {COMPRESSION => 'gz' }
  family: data
    dob - date of birth of this employee.
    e_info - avro structure for storing emp info.
    e_dept- avro structure for storing info about dept.

  family: extra - Extra Metadata {NAME => 'extra', BLOOMFILTER => 'ROW', VERSIONS => '1', IN_MEMORY => 'false', KEEP_DELETED_CELLS => 'FALSE', DATA_BLOCK_ENCODING => 'NONE', TTL => 'FOREVER', COMPRESSION => 'SNAPPY', MIN_VERSIONS => '0', BLOCKCACHE => 'true', BLOCKSIZE => '65536', REPLICATION_SCOPE => '0'}
    e_region - emp region
    e_status - some data about his achievements
    .
    .
    some more meta data

The table has some columns that have simple string data in them, and some columns that has AVRO stuctures in them. 该表中的某些列中包含简单的字符串数据，某些列中具有AVRO结构。

I am trying to read this data directly from the HBase backup files in the S3. 我正在尝试直接从S3中的HBase备份文件读取此数据。 I do not want to re-create this HBase table in my local machine as the table is very, very large. 我不想在本地计算机上重新创建此HBase表，因为该表非常大。

This is how I am trying to read this: 这就是我试图阅读的内容：

data.keys.map{k=>(new String(k.get()))}.take(1)
res1: Array[String] = Array(111111111100011010102462)

data.values.map{ v =>{ for(cell <- v.rawCells()) yield{
                        val family = CellUtil.cloneFamily(cell);
                        val  column = CellUtil.cloneQualifier(cell);
                        val  value = CellUtil.cloneValue(cell);
                            new String(family) +"->"+ new String(column)+ "->"+ new String(value)
                         }
                      }  
}.take(1)
res2: Array[Array[String]] = Array(Array(info->dob->01/01/1996,  info->e_info->?ж�?�ո� ?�� ???̶�?�ո� ?�� ????, info->e_dept->?ж�??�ո� ?̶�??�ո� �ո� ??, extra->e_region-> CA, extra->e_status->, .....))

As expected I can see the simple string data correctly, but the AVRO data is garbage. 不出所料，我可以正确看到简单的字符串数据，但是AVRO数据是垃圾。

I tried reading the AVRO structures using GenericDatumReader : 我尝试使用GenericDatumReader读取AVRO结构：

data.values.map{ v =>{ for(cell <- v.rawCells()) yield{
                        val family = new String(CellUtil.cloneFamily(cell));
                        val  column = new String(CellUtil.cloneQualifier(cell));
                        val  value = CellUtil.cloneValue(cell);
                        if(column=="e_info"){
                          var schema_obj =  new Schema.Parser
                          //schema_e_info contains the AVRO schema for e_info
                          var schema = schema_obj.parse(schema_e_info)
                          var READER2 = new GenericDatumReader[GenericRecord](schema)
                          var datum= READER2.read(null, DecoderFactory.defaultFactory.createBinaryDecoder(value,null))
                          var result=datum.get("type").toString()
                                family +"->"+column+ "->"+ new String(result) + "\n"
                            }
                        else
                           family +"->"+column+ "->"+ new String(value)+"\n"
                        }
                }        

}

But this is giving me the following error: 但这给了我以下错误：

org.apache.spark.SparkException: Task not serializable
  at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
  at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
  at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
  at org.apache.spark.SparkContext.clean(SparkContext.scala:2101)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:370)
  at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:369)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
  at org.apache.spark.rdd.RDD.map(RDD.scala:369)
  ... 74 elided
Caused by: java.io.NotSerializableException: org.apache.avro.Schema$RecordSchema
Serialization stack:
    - object not serializable (class: org.apache.avro.Schema$RecordSchema, value: .....

So I want to ask: 所以我想问：

Is there any way to make the non-serializable class RecordSchema work with the map function? 有什么方法可以使不可序列化的类RecordSchema与map函数一起工作？
Is my approach right upto this point? 到目前为止，我的方法正确吗？ I would be glad to know about better approaches to handle this kind of data. 我很高兴知道处理此类数据的更好方法。
I read that handling this in a Dataframe would be a lot easier. 我读到在Dataframe中处理此操作会容易得多。 I tried to convert the Hadoop RDD so formed into a Dataframe, but again I am running blindly there. 我试图将这样形成的Hadoop RDD转换为Dataframe，但是我又在那儿盲目运行。

Answer 1

As the exception says - the schema is non-serializable. 如异常所示-架构不可序列化。 Can you initialize it inside the mapper function? 您可以在mapper函数中对其进行初始化吗？ So that it doesn't need to get shipped from the driver to the executors. 这样就无需将其从驱动程序运送到执行程序。

Alternatively, you can also create a scala singleton object that contains the schema. 或者，您也可以创建一个包含模式的Scala单例对象。 You get one scala singleton initialized on each executor, so when you access any member from the singleton, it doesn't need to be serialized & sent across the network. 您会在每个执行器上初始化一个scala单例，因此，当您从单例访问任何成员时，不需要对其进行序列化并通过网络发送。 This avoids the unnecessary overhead of re-creating the schema for each and every row in the data. 这避免了为数据中的每一行重新创建架构的不必要的开销。

Just for the purpose of checking that you can read the data fine - you can also convert it to a byte array on the executors, collect it on the driver and do the deserialization (parsing the AVRO data) in the driver code. 只是为了检查您是否可以正确读取数据-您还可以在执行程序上将其转换为字节数组，在驱动程序上收集它，并在驱动程序代码中进行反序列化（解析AVRO数据）。 But this obviously won't scale, it's just to make sure that your data looks good and to avoid spark-related complications while you're writing your prototype code to extract the data. 但这显然不会扩展，只是为了确保您的数据看起来不错，并避免在编写原型代码以提取数据时出现火花相关的复杂情况。

读取保存在HBase列中的AVRO结构

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-06-27 12:23:26

读取保存在HBase列中的AVRO结构

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-06-27 12:23:26

解决方案1
1 已采纳 2017-06-27 12:23:26