简体   繁体   中英

loading csv file to HBase through Spark

this is simple " how to " question:: We can bring data to Spark environment through com.databricks.spark.csv. I do know how to create HBase table through spark, and write data to the HBase tables manually. But is that even possible to load a text/csv/jason files directly to HBase through Spark? I cannot see anybody talking about it. So, just checking. If possible, please guide me to a good website that explains the scala code in detail to get it done.

Thank you,

There are multiple ways you can do that.

  1. Spark Hbase connector:

https://github.com/hortonworks-spark/shc

You can see lot of examples on the link.

  1. Also you can use SPark core to load the data to Hbase using HbaseConfiguration.

Code Example:

val fileRDD = sc.textFile(args(0), 2)
  val transformedRDD = fileRDD.map { line => convertToKeyValuePairs(line) }

  val conf = HBaseConfiguration.create()
  conf.set(TableOutputFormat.OUTPUT_TABLE, "tableName")
  conf.set("hbase.zookeeper.quorum", "localhost:2181")
  conf.set("hbase.master", "localhost:60000")
  conf.set("fs.default.name", "hdfs://localhost:8020")
  conf.set("hbase.rootdir", "/hbase")

  val jobConf = new Configuration(conf)
  jobConf.set("mapreduce.job.output.key.class", classOf[Text].getName)
  jobConf.set("mapreduce.job.output.value.class", classOf[LongWritable].getName)
  jobConf.set("mapreduce.outputformat.class", classOf[TableOutputFormat[Text]].getName)

  transformedRDD.saveAsNewAPIHadoopDataset(jobConf)



def convertToKeyValuePairs(line: String): (ImmutableBytesWritable, Put) = {

    val cfDataBytes = Bytes.toBytes("cf")
    val rowkey = Bytes.toBytes(line.split("\\|")(1))
    val put = new Put(rowkey)

    put.add(cfDataBytes, Bytes.toBytes("PaymentDate"), Bytes.toBytes(line.split("|")(0)))
    put.add(cfDataBytes, Bytes.toBytes("PaymentNumber"), Bytes.toBytes(line.split("|")(1)))
    put.add(cfDataBytes, Bytes.toBytes("VendorName"), Bytes.toBytes(line.split("|")(2)))
    put.add(cfDataBytes, Bytes.toBytes("Category"), Bytes.toBytes(line.split("|")(3)))
    put.add(cfDataBytes, Bytes.toBytes("Amount"), Bytes.toBytes(line.split("|")(4)))
    return (new ImmutableBytesWritable(rowkey), put)
  }
  1. Also you can use this one

https://github.com/nerdammer/spark-hbase-connector

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM