简体   繁体   English

AVRO 文件上的 Hive 外部表只为所有列生成 NULL 数据

[英]Hive External table on AVRO file producing only NULL data for all columns

I am trying to create an Hive external table on top of some avro files which are generated using spark-scala .我正在尝试在一些使用spark-scala生成的avro文件之上创建一个Hive external table I am using CDH 5.16 which has hive 1.1 , spark 1.6 .我正在使用CDH 5.16 ,它具有hive 1.1spark 1.6

I created hive external table , which ran successfully.我创建了hive external table ,它运行成功。 But when i query the data i am getting NULL for all the columns.但是当我查询数据时,所有列都为NULL My problem is similar to this 我的问题与此类似

Upon some research, i found out it might be the problem with schema.经过一些研究,我发现这可能是模式的问题。 But i couldn't find the schema file for these avro files in the location.但是我在该位置找不到这些 avro 文件的架构文件。

I am pretty new to avro file type.我对avro文件类型很avro Can some one please help me out here.有人可以在这里帮助我。

Below is my spark code snippet where i have saved the file as avro :下面是我将文件保存为avro spark代码片段:

df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

Below is my hive external table create statement:下面是我的 hive 外部表创建语句:

create external table prod_order_avro
(ProductID string,
ProductName string,
categoryname string,
OrderDate string,
Freight string,
OrderID string,
ShipperID string,
Quantity string,
Sales string,
Discount string,
COS string,
GP string,
CategoryID string,
oh_Updated_time string,
od_Updated_time string
)
STORED AS AVRO
LOCATION '/user/hive/warehouse/transform.db/prod_order_avro';

Below is the result i am getting when i query the data: select * from prod_order_avro下面是我查询数据时得到的结果: select * from prod_order_avro

结果

At the same time, when i am reading these avro files using spark-scala as dataframe and printing them, i am getting proper result.同时,当我使用spark-scala作为dataframe读取这些avro文件并打印它们时,我得到了正确的结果。 Below is the spark code i used to read these data:下面是我用来读取这些数据的spark代码:

val df=hiveContext.read.format("com.databricks.spark.avro").option("header","true").load("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

通过 spark-scala 读取时 avro 文件数据

My question is,我的问题是,

  • While creating these avro files, do i need to change my spark在创建这些avro文件时,我是否需要更改我的spark
    code to create schema files separately or will it be embedded with单独创建架构文件的代码或将其嵌入
    the files.文件。 If needs to be separate, then how to achieve it?如果需要分开,那么如何实现呢?
  • If not how to create hive table so that schema is retrieved from the file automatically.如果不是如何创建hive表,以便自动从文件中检索架构。 I read that in latest version hive takes care of this issue by itself if schema is present in the files.我读到,如果文件中存在架构,则在最新版本中,hive 会自行解决此问题。

Kindly help me out here请帮助我在这里

Resolved this..it was a schema issue.解决了这个……这是一个架构问题。 The schema was not embedded with the avro files.So i had to extract schema using avro-tools and passed it while creating table.该架构未嵌入到avro所以我必须使用avro-tools提取架构并在创建表时传递它。 Its working now.它现在工作。

I followed the below steps:我按照以下步骤操作:

  1. Extracted few data from avro files stored in hdfs into a file in local system.从存储在hdfs avro文件中提取少量数据到本地系统中的文件中。 Below is the command used for the same:以下是用于相同的命令:

    sudo hdfs dfs -cat /path/file.avro | head --bytes 10K > /path/temp.txt

  2. Used avro-tools getschema command to extract schema from this data:使用avro-tools getschema命令从此数据中提取模式:

    avro-tools getschema /path/temp.txt

  3. Copy the resulting schema(it will be in the form of json data) into a new file with .avsc extension and upload the same into HDFS将生成的模式(它将以json数据的形式)复制到一个扩展名为.avsc的新文件中,并将其上传到HDFS

  4. While creating the Hive External table add the below property to it:在创建Hive External table向其添加以下属性:

    TBLPROPERTIES('avro.schema.url'='hdfs://path/schema.avsc')

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM