[英]Hive External table on AVRO file producing only NULL data for all columns
I am trying to create an Hive external table
on top of some avro
files which are generated using spark-scala
.我正在尝试在一些使用
spark-scala
生成的avro
文件之上创建一个Hive external table
。 I am using CDH 5.16
which has hive 1.1
, spark 1.6
.我正在使用
CDH 5.16
,它具有hive 1.1
和spark 1.6
。
I created hive external table
, which ran successfully.我创建了
hive external table
,它运行成功。 But when i query the data i am getting NULL
for all the columns.但是当我查询数据时,所有列都为
NULL
。 My problem is similar to this 我的问题与此类似
Upon some research, i found out it might be the problem with schema.经过一些研究,我发现这可能是模式的问题。 But i couldn't find the schema file for these avro files in the location.
但是我在该位置找不到这些 avro 文件的架构文件。
I am pretty new to avro
file type.我对
avro
文件类型很avro
。 Can some one please help me out here.有人可以在这里帮助我。
Below is my spark
code snippet where i have saved the file as avro
:下面是我将文件保存为
avro
spark
代码片段:
df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")
Below is my hive external table create statement:下面是我的 hive 外部表创建语句:
create external table prod_order_avro
(ProductID string,
ProductName string,
categoryname string,
OrderDate string,
Freight string,
OrderID string,
ShipperID string,
Quantity string,
Sales string,
Discount string,
COS string,
GP string,
CategoryID string,
oh_Updated_time string,
od_Updated_time string
)
STORED AS AVRO
LOCATION '/user/hive/warehouse/transform.db/prod_order_avro';
Below is the result i am getting when i query the data: select * from prod_order_avro
下面是我查询数据时得到的结果:
select * from prod_order_avro
At the same time, when i am reading these avro
files using spark-scala
as dataframe
and printing them, i am getting proper result.同时,当我使用
spark-scala
作为dataframe
读取这些avro
文件并打印它们时,我得到了正确的结果。 Below is the spark
code i used to read these data:下面是我用来读取这些数据的
spark
代码:
val df=hiveContext.read.format("com.databricks.spark.avro").option("header","true").load("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")
My question is,我的问题是,
avro
files, do i need to change my spark
avro
文件时,我是否需要更改我的spark
hive
table so that schema is retrieved from the file automatically.hive
表,以便自动从文件中检索架构。 I read that in latest version hive takes care of this issue by itself if schema is present in the files. Kindly help me out here请帮助我在这里
Resolved this..it was a schema issue.解决了这个……这是一个架构问题。 The schema was not embedded with the
avro
files.So i had to extract schema using avro-tools
and passed it while creating table.该架构未嵌入到
avro
所以我必须使用avro-tools
提取架构并在创建表时传递它。 Its working now.它现在工作。
I followed the below steps:我按照以下步骤操作:
Extracted few data from avro
files stored in hdfs
into a file in local system.从存储在
hdfs
avro
文件中提取少量数据到本地系统中的文件中。 Below is the command used for the same:以下是用于相同的命令:
sudo hdfs dfs -cat /path/file.avro | head --bytes 10K > /path/temp.txt
Used avro-tools getschema
command to extract schema from this data:使用
avro-tools getschema
命令从此数据中提取模式:
avro-tools getschema /path/temp.txt
Copy the resulting schema(it will be in the form of json
data) into a new file with .avsc
extension and upload the same into HDFS
将生成的模式(它将以
json
数据的形式)复制到一个扩展名为.avsc
的新文件中,并将其上传到HDFS
While creating the Hive External table
add the below property to it:在创建
Hive External table
向其添加以下属性:
TBLPROPERTIES('avro.schema.url'='hdfs://path/schema.avsc')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.