[英]Scala/Spark serialization error - streaming data to HBASE
I am a newbie to Scala/Spark. 我是Scala / Spark的新手。 In the following code, I am extracting Twitter public stream content to the HBase.
在以下代码中,我将Twitter公共流内容提取到HBase。
On commenting the last four lines (put commands in HBase), I am able to print content of tweet on the terminal, however unable to dump it to the HBase table. 在注释最后四行(在HBase中放入命令)时,我能够在终端上打印tweet的内容,但是无法将其转储到HBase表中。
I need help in on the following regards: 我需要以下方面的帮助:
1. How can I overcome the serialilzation error? 1.如何克服序列化错误?
2. Are there efficient methods (may be useing Kryo serialilzation) to overcome this error? 2.是否有有效的方法(可能使用Kryo序列化)来克服此错误?
Caused by: java.io.NotSerializableException:
引起原因:java.io.NotSerializableException:
org.apache.hadoop.conf.Configuration Serialization stack:org.apache.hadoop.conf.Configuration序列化堆栈:
- object not serializable (class: org.apache.hadoop.conf.Configuration, value: Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml)-不可序列化的对象(类:org.apache.hadoop.conf.Configuration,值:Configuration:core-default.xml,core-site.xml,mapred-default.xml,mapred-site.xml,yarn-default.xml ,yarn-site.xml,hdfs-default.xml,hdfs-site.xml)
import twitter4j.auth._
import twitter4j.conf._
import twitter4j._
import twitter4j.json._
import scala.io.Source
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.client.ConnectionFactory
import org.apache.hadoop.hbase.client.HBaseAdmin
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.HColumnDescriptor
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.KeyValue
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.TableName
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, HTableDescriptor}
import org.apache.hadoop.mapreduce.Job
import org.apache.spark._
import org.apache.spark.rdd.NewHadoopRDD
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.hadoop.hbase.util.Bytes
import java.io._
import org.apache.spark.streaming.twitter.TwitterUtils
////////////////////////////
val conf = new SparkConf().setAppName("model1").setMaster("local[*]")
// val sc = new SparkContext(conf)
val TABLE_NAME = "publicrd"
val CF_USER = "user"
val CF_TWEET = "tweet"
val CF_ENTITIES = "entities"
val CF_PLACES = "places"
val hadoopConf = new Configuration
val conf = HBaseConfiguration.create(hadoopConf)
val admin = new HBaseAdmin(conf)
val tableDesc = new HTableDescriptor(Bytes.toBytes(TABLE_NAME))
// Define column family descriptor
val ColumnFamilyDesc1 = new HColumnDescriptor(Bytes.toBytes(CF_USER))
val ColumnFamilyDesc2 = new HColumnDescriptor(Bytes.toBytes(CF_TWEET))
val ColumnFamilyDesc3 = new HColumnDescriptor(Bytes.toBytes(CF_ENTITIES))
val ColumnFamilyDesc4 = new HColumnDescriptor(Bytes.toBytes(CF_PLACES))
// Add column family in table descriptor
tableDesc.addFamily(ColumnFamilyDesc1)
tableDesc.addFamily(ColumnFamilyDesc2)
tableDesc.addFamily(ColumnFamilyDesc3)
tableDesc.addFamily(ColumnFamilyDesc4)
// Check if the table exists
if (admin.tableExists(TABLE_NAME)){
print(">>>>>" + TABLE_NAME + " already exists <<<<<")
admin.disableTable(TABLE_NAME)
admin.deleteTable(TABLE_NAME)
}
// Create HBASE table
admin.createTable(tableDesc)
val table = new HTable(conf, TABLE_NAME)
/////
val timewindow = 2 // seconds
val ssc = new StreamingContext(sc, Seconds(timewindow))
val cb = new ConfigurationBuilder
val ckey = "ckey"
val csecret = "csecret"
val atoken = "atoken"
val atokensecret = "atokensecret"
cb.setDebugEnabled(true).
setOAuthConsumerKey(ckey).
setOAuthConsumerSecret(csecret).
setOAuthAccessToken(atoken).
setOAuthAccessTokenSecret(atokensecret).
setJSONStoreEnabled(true)
val auth = new OAuthAuthorization(cb.build)
val tweets = TwitterUtils.createStream(ssc,Some(auth))
val status = tweets.filter(_.getLang()=="en")
status.foreachRDD(foreachFunc = rdd => {
rdd.foreachPartition {
records => while (records.hasNext) {
var record = records.next
print("\n\n>>>>"+record)
var tweetID = record.getUser().getId().toString//.isInstanceOf[Int]
print("\ntweetID : "+tweetID)
var tweetBody = record.getText()//.toString
print("\ntweetBody : "+tweetBody)
var favoritesCount = record.getFavoriteCount()//.toInt
print("\nfavoritesCount : "+favoritesCount)
var keyrow = "t_"+tweetID //"t_"+
print("\nkeyrow : "+keyrow+"\n")
var theput= new Put(Bytes.toBytes(keyrow))
theput.add(Bytes.toBytes(CF_TWEET),Bytes.toBytes("tweetid"),Bytes.toBytes(tweetID))
theput.add(Bytes.toBytes(CF_TWEET),Bytes.toBytes("tweetid"),Bytes.toBytes(tweetBody))
theput.add(Bytes.toBytes(CF_USER),Bytes.toBytes("tweetid"),Bytes.toBytes(favoritesCount))
table.put(theput)
}
}
}
)
The code is run on the terminal via: 该代码通过以下方式在终端上运行:
spark-shell --driver-class-path /opt/hadoop/hbase-1.2.1/lib/hbase-server-1.1.4.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-protocol-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-hadoop2-compat-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-client-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/hbase-common-1.0.0-cdh5.5.0.jar:/opt/hadoop/hbase-1.2.1/lib/htrace-core-3.2.0-incubating.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/guava-19.0.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/spark-streaming-twitter_2.10-1.6.1.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-async-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-core-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-examples-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-media-support-4.0.4.jar:/home/cloudera/Desktop/hbase/twitter4jJARS/twitter4j-stream-4.0.4.jar
It says the object org.apache.hadoop.conf.Configuration
is not serialisable which mean it does not implement the Serializable
interface while it's required. 它说对象
org.apache.hadoop.conf.Configuration
不可序列化,这意味着它在需要时未实现Serializable
接口。 To get rid of that add @transient
keyword. 要摆脱这种情况,请添加
@transient
关键字。
@transient val hadoopConf = new Configuration
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.