简体   繁体   English

Scala / Spark序列化错误-将数据流传输到HBASE

[英]Scala/Spark serialization error - streaming data to HBASE

I am a newbie to Scala/Spark. 我是Scala / Spark的新手。 In the following code, I am extracting Twitter public stream content to the HBase. 在以下代码中,我将Twitter公共流内容提取到HBase。
On commenting the last four lines (put commands in HBase), I am able to print content of tweet on the terminal, however unable to dump it to the HBase table. 在注释最后四行(在HBase中放入命令)时,我能够在终端上打印tweet的内容,但是无法将其转储到HBase表中。

I need help in on the following regards: 我需要以下方面的帮助:
1. How can I overcome the serialilzation error? 1.如何克服序列化错误?
2. Are there efficient methods (may be useing Kryo serialilzation) to overcome this error? 2.是否有有效的方法(可能使用Kryo序列化)来克服此错误?

Caused by: java.io.NotSerializableException: 引起原因:java.io.NotSerializableException:
org.apache.hadoop.conf.Configuration Serialization stack: org.apache.hadoop.conf.Configuration序列化堆栈:
- object not serializable (class: org.apache.hadoop.conf.Configuration, value: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml) -不可序列化的对象(类:org.apache.hadoop.conf.Configuration,值:Configuration:core-default.xml,core-site.xml,mapred-default.xml,mapred-site.xml,yarn-default.xml ,yarn-site.xml,hdfs-default.xml,hdfs-site.xml)

import twitter4j.auth._
import twitter4j.conf._
import twitter4j._
import twitter4j.json._
import scala.io.Source
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.mapreduce.Job
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.hadoop.hbase.util.Bytes
import java.io._
import org.apache.spark.streaming.twitter.TwitterUtils

////////////////////////////
val conf = new SparkConf().setAppName("model1").setMaster("local[*]")
// val sc = new SparkContext(conf)

val TABLE_NAME = "publicrd"
val CF_USER = "user"
val CF_TWEET = "tweet"
val CF_ENTITIES = "entities"
val CF_PLACES = "places"

val hadoopConf = new Configuration
val conf = HBaseConfiguration.create(hadoopConf)
val admin = new HBaseAdmin(conf)
val tableDesc = new HTableDescriptor(Bytes.toBytes(TABLE_NAME))

// Define column family descriptor
val ColumnFamilyDesc1 = new HColumnDescriptor(Bytes.toBytes(CF_USER))
val ColumnFamilyDesc2 = new HColumnDescriptor(Bytes.toBytes(CF_TWEET))
val ColumnFamilyDesc3 = new HColumnDescriptor(Bytes.toBytes(CF_ENTITIES))
val ColumnFamilyDesc4 = new HColumnDescriptor(Bytes.toBytes(CF_PLACES))

// Add column family in table descriptor
tableDesc.addFamily(ColumnFamilyDesc1)
tableDesc.addFamily(ColumnFamilyDesc2)
tableDesc.addFamily(ColumnFamilyDesc3)
tableDesc.addFamily(ColumnFamilyDesc4)

// Check if the table exists
if (admin.tableExists(TABLE_NAME)){
print(">>>>>" + TABLE_NAME + " already exists <<<<<")
admin.disableTable(TABLE_NAME)
admin.deleteTable(TABLE_NAME)
}

// Create HBASE table
admin.createTable(tableDesc)
val table = new HTable(conf, TABLE_NAME)
/////

val timewindow = 2 // seconds

val ssc = new StreamingContext(sc, Seconds(timewindow))
val cb = new ConfigurationBuilder

val ckey = "ckey"
val csecret = "csecret"
val atoken = "atoken"
val atokensecret = "atokensecret"

cb.setDebugEnabled(true).
setOAuthConsumerKey(ckey).
setOAuthConsumerSecret(csecret).
setOAuthAccessToken(atoken).
setOAuthAccessTokenSecret(atokensecret).
setJSONStoreEnabled(true)

val auth = new OAuthAuthorization(cb.build)
val tweets = TwitterUtils.createStream(ssc,Some(auth)) 

val status = tweets.filter(_.getLang()=="en")

status.foreachRDD(foreachFunc = rdd => {
    rdd.foreachPartition {

    records => while (records.hasNext) {

        var record = records.next
        print("\n\n>>>>"+record)

        var tweetID = record.getUser().getId().toString//.isInstanceOf[Int]
        print("\ntweetID : "+tweetID)

        var tweetBody = record.getText()//.toString
        print("\ntweetBody : "+tweetBody)

        var favoritesCount = record.getFavoriteCount()//.toInt
        print("\nfavoritesCount : "+favoritesCount)

        var keyrow = "t_"+tweetID //"t_"+
        print("\nkeyrow : "+keyrow+"\n")

        var theput= new Put(Bytes.toBytes(keyrow))
        theput.add(Bytes.toBytes(CF_TWEET),Bytes.toBytes("tweetid"),Bytes.toBytes(tweetID)) 
        theput.add(Bytes.toBytes(CF_TWEET),Bytes.toBytes("tweetid"),Bytes.toBytes(tweetBody))
        theput.add(Bytes.toBytes(CF_USER),Bytes.toBytes("tweetid"),Bytes.toBytes(favoritesCount))
        table.put(theput)
        }
    }
}
)

The code is run on the terminal via: 该代码通过以下方式在终端上运行:

spark-shell --driver-class-path /opt/hadoop/hbase-1.2.1/lib/hbase-server-1.1.4.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-protocol-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-hadoop2-compat-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-client-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-common-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/htrace-core-3.2.0-incubating.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/guava-19.0.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/spark-streaming-twitter_2.10-1.6.1.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-async-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-core-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-examples-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-media-support-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-stream-4.0.4.jar

It says the object org.apache.hadoop.conf.Configuration is not serialisable which mean it does not implement the Serializable interface while it's required. 它说对象org.apache.hadoop.conf.Configuration不可序列化,这意味着它在需要时未实现Serializable接口。 To get rid of that add @transient keyword. 要摆脱这种情况,请添加@transient关键字。

@transient val hadoopConf = new Configuration

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM