简体   繁体   English

如何使用Spark将数据加载到Hive外部表中?

[英]How to load data into hive external table using spark?

I want to try to load data into hive external table using spark. 我想尝试使用spark将数据加载到配置单元外部表中。 please help me on this, how to load data into hive using scala code or java 请帮助我,如何使用Scala代码或Java将数据加载到Hive中

Thanks in advance 提前致谢

Assuming that hive external table is already created using something like, 假设配置单元外部表已经使用类似方法创建,

CREATE EXTERNAL TABLE external_parquet(c1 INT, c2 STRING, c3 TIMESTAMP) 
    STORED AS PARQUET LOCATION '/user/etl/destination';   -- location is some directory on HDFS

And you have an existing dataFrame / RDD in Spark, that you want to write. 并且您要在Spark中创建一个现有的dataFrame / RDD。

import sqlContext.implicits._
val rdd = sc.parallelize(List((1, "a", new Date), (2, "b", new Date), (3, "c", new Date)))
val df = rdd.toDF("c1", "c2", "c3")  //column names for your data frame
df.write.mode(SaveMode.Overwrite).parquet("/user/etl/destination") // If you want to overwrite existing dataset (full reimport from some source)

If you don't want to overwrite existing data from your dataset... 如果您不想覆盖数据集中的现有数据...

df.write.mode(SaveMode.Append).parquet("/user/etl/destination")  // If you want to append to existing dataset (incremental imports)

**I have tried similar scenario and had satisfactory results.I have worked with avro data with schema in json.I streamed kafka topic with spark streaming and persisted the data in to hdfs which is the location of an external table.So every 2 seconds(the streaming duration the data will be stored in to hdfs in a seperate file and the hive external table will be appended as well). **我尝试过类似的情况并获得令人满意的结果。我使用json中的架构处理avro数据。我通过spark流传输了kafka主题,并将数据持久存储在hdfs中,这是外部表的位置,因此每2秒(数据的流式传输持续时间将存储在单独文件中的hdfs中,并且还将附加配置单元外部表)。

Here is the simple code snippet 这是简单的代码片段

 val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
    messages.foreachRDD(rdd =>
      {
        val sqlContext = new org.apache.spark.sql.SQLContext(sc)
        import sqlContext.implicits._

        val dataframe = sqlContext.read.json(rdd.map(_._2))
        val myEvent = dataframe.toDF()
        import org.apache.spark.sql.SaveMode


        myEvent.write.format("parquet").mode(org.apache.spark.sql.SaveMode.Append).save("maprfs:///location/of/hive/external/table")
      })

Don't forget to stop the ' SSC ' at the end of the application.Doing it gracefully is more preferable. 不要忘记在应用程序结尾处停止' SSC '。

PS: Note that while creating an external table make sure you are creating the table with schema identical to the dataframe schema. PS: 请注意,在创建外部表时,请确保所创建的表的架构与数据框架构相同。 Because when getting converted in to a dataframe which is nothing but a table, the columns will be arranged in an alphabetic order. 因为当转换成一个只不过是一个表的数据帧时,这些列将按字母顺序排列。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM