AVRO 文件上的 Hive 外部表只为所有列生成 NULL 数据

Question

I am trying to create an Hive external table on top of some avro files which are generated using spark-scala .我正在尝试在一些使用spark-scala生成的avro文件之上创建一个Hive external table 。 I am using CDH 5.16 which has hive 1.1 , spark 1.6 .我正在使用CDH 5.16 ，它具有hive 1.1和spark 1.6 。

I created hive external table , which ran successfully.我创建了hive external table ，它运行成功。 But when i query the data i am getting NULL for all the columns.但是当我查询数据时，所有列都为NULL 。 My problem is similar to this 我的问题与此类似

Upon some research, i found out it might be the problem with schema.经过一些研究，我发现这可能是模式的问题。 But i couldn't find the schema file for these avro files in the location.但是我在该位置找不到这些 avro 文件的架构文件。

I am pretty new to avro file type.我对avro文件类型很avro 。 Can some one please help me out here.有人可以在这里帮助我。

Below is my spark code snippet where i have saved the file as avro :下面是我将文件保存为avro spark代码片段：

df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

Below is my hive external table create statement:下面是我的 hive 外部表创建语句：

create external table prod_order_avro
(ProductID string,
ProductName string,
categoryname string,
OrderDate string,
Freight string,
OrderID string,
ShipperID string,
Quantity string,
Sales string,
Discount string,
COS string,
GP string,
CategoryID string,
oh_Updated_time string,
od_Updated_time string
)
STORED AS AVRO
LOCATION '/user/hive/warehouse/transform.db/prod_order_avro';

Below is the result i am getting when i query the data: select * from prod_order_avro下面是我查询数据时得到的结果： select * from prod_order_avro

At the same time, when i am reading these avro files using spark-scala as dataframe and printing them, i am getting proper result.同时，当我使用spark-scala作为dataframe读取这些avro文件并打印它们时，我得到了正确的结果。 Below is the spark code i used to read these data:下面是我用来读取这些数据的spark代码：

val df=hiveContext.read.format("com.databricks.spark.avro").option("header","true").load("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

My question is,我的问题是，

While creating these avro files, do i need to change my spark在创建这些avro文件时，我是否需要更改我的spark
code to create schema files separately or will it be embedded with单独创建架构文件的代码或将其嵌入
the files.文件。 If needs to be separate, then how to achieve it?如果需要分开，那么如何实现呢？
If not how to create hive table so that schema is retrieved from the file automatically.如果不是如何创建hive表，以便自动从文件中检索架构。 I read that in latest version hive takes care of this issue by itself if schema is present in the files.我读到，如果文件中存在架构，则在最新版本中，hive 会自行解决此问题。

Kindly help me out here请帮助我在这里

Answer 1

Resolved this..it was a schema issue.解决了这个……这是一个架构问题。 The schema was not embedded with the avro files.So i had to extract schema using avro-tools and passed it while creating table.该架构未嵌入到avro所以我必须使用avro-tools提取架构并在创建表时传递它。 Its working now.它现在工作。

I followed the below steps:我按照以下步骤操作：

Extracted few data from avro files stored in hdfs into a file in local system.从存储在hdfs avro文件中提取少量数据到本地系统中的文件中。 Below is the command used for the same:以下是用于相同的命令：
sudo hdfs dfs -cat /path/file.avro | head --bytes 10K > /path/temp.txt
Used avro-tools getschema command to extract schema from this data:使用avro-tools getschema命令从此数据中提取模式：
avro-tools getschema /path/temp.txt
Copy the resulting schema(it will be in the form of json data) into a new file with .avsc extension and upload the same into HDFS将生成的模式（它将以json数据的形式）复制到一个扩展名为.avsc的新文件中，并将其上传到HDFS
While creating the Hive External table add the below property to it:在创建Hive External table向其添加以下属性：
TBLPROPERTIES('avro.schema.url'='hdfs://path/schema.avsc')

AVRO 文件上的 Hive 外部表只为所有列生成 NULL 数据

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-07-19 06:02:15

AVRO 文件上的 Hive 外部表只为所有列生成 NULL 数据

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-07-19 06:02:15

解决方案1
2 已采纳 2019-07-19 06:02:15