简体   繁体   English

如何直接从Hive表创建RDD?

[英]How to create an RDD directly from Hive table?

I am learning spark and creating rdd using the SparkContext object and using some local files, s3 and hdfs as follows: 我正在学习spark并使用SparkContext对象并使用一些本地文件s3和hdfs创建rdd,如下所示:

val lines = sc.textFile("file://../kv/mydata.log")

val lines = sc.textFile("s3n://../kv/mydata.log")

val lines = sc.textFile("hdfs://../kv/mydata.log")

Now i have some data in Hive tables. 现在我在Hive表中有一些数据。 Is it possible to load hive table's directly and use that data as an RDD? 是否可以直接加载配置单元表并将该数据用作RDD?

It can be done using the HiveContext as follows: 可以使用HiveContext如下进行:

val hiveContext = HiveContext(sc);
val rows = hiveContext.sql("Select name, age from students")

RDDs have now become obsolete. RDD现在已经过时了。 You can read the data directly from Hive tables to DataFrames using the new spark APIs. 您可以使用新的spark API将数据直接从Hive表读取到DataFrames。 Here's the link for Spark version 2.3.0 (change the version based on your installation.) 这是Spark版本2.3.0的链接(根据您的安装更改版本。)

https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#hive-tables https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#hive-tables

Here's a sample program. 这是一个示例程序。 You can store the result of the last line into a DataFrame and do all sorts of operation that you would normally do on an RDD like map, filter. 您可以将最后一行的结果存储到DataFrame中,并执行通常在RDD上执行的各种操作,例如映射,过滤器。

//Accessing Hive tables from Spark
import java.io.File
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
case class People(name:String,age:Int,city:String,state:String,height:Double,weight:Double)
val warehouseLocation = new File("spark-warehouse").getAbsolutePath
val spark = SparkSession.builder.master("yarn").appName("My Hive 
 App").config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport()
  .getOrCreate()
import spark.implicits._
import spark.sql
sql("CREATE TABLE IF NOT EXISTS people(name String,age Int,city String,state String,height Double,weight Double)  ROW FORMAT DELIMITED FIELDS TERMINATED BY ','")
sql("LOAD DATA LOCAL INPATH 'file:/home/amalprakash32203955/data/people1.txt' INTO TABLE people")
sql("SELECT * FROM people").show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用在hive表中分区的数据从RC文件创建RDD - How to create a RDD from RC file using data which is partitioned in the hive table 将 ArrayBuffer 转换为 DataFrame 中的 HashSet 到 Hive 表中的 RDD 时的 GenericRowWithSchema 异常 - GenericRowWithSchema exception in casting ArrayBuffer to HashSet in DataFrame to RDD from Hive table 如何从行创建RDD - How to create an RDD from a Row 如何从RDD创建Spark数据集 - How to create a Spark Dataset from an RDD 如何通过从现有 RDD 中选择特定数据来创建 RDD,其中 output 应该是 RDD[String]? - How to create an RDD by selecting specific data from an existing RDD where output should of RDD[String]? spark-在Spark中读取Hive表时从RDD [Row]中提取元素 - spark - extract elements from an RDD[Row] when reading Hive table in Spark 来自 hive 表的数据帧遍历每个元素以进行某些操作并写入 df、rdd、list - dataframe from hive table to iterate through each element for some operation and write in df,rdd,list 使用 Scala 数据帧中的分区创建配置单元表 - Create hive table with partitions from a Scala dataframe 无法从 Spark 创建 Hive 表 - Unable to create Hive table from spark 如何使用 hive 外部 hive 表创建一个空的 dataframe? - How to create an empty dataframe using hive external hive table?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM