简体   繁体   English

是否可以在蜂巢外部表中压缩包含Json数据的Parquet文件?

[英]Is it possible to compress Parquet file which contain Json data in hive external table?

I want to know how to compress Parquet file which contain Json data in hive external table. 我想知道如何压缩配置单元外部表中包含Json数据的Parquet文件。 How can it be done? 如何做呢?

I have created external table like this: 我已经创建了这样的外部表:

create table parquet_table_name3(id BIGINT,created_at STRING,source STRING,favorited BOOLEAN) ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' LOCATION '/user/cloudera/parquet2';

and I had set the compression properties 我已经设置了压缩属性

set parquet.compression=GZIP;

and compressed my input Parquet file by executing 并通过执行压缩我的输入Parquet文件

GZIP <file name> ( i.e 000000_0.Parquet) 

after that i have load compresed GZIP file into hdfs location /user/cloudera/parquet2 之后,我将压缩的GZIP文件加载到hdfs位置/user/cloudera/parquet2

next i have try to run the run the below query 接下来,我尝试运行以下查询

select * from parquet_table_name3;

i am getting bellow result 我得到波纹管结果

NULL    NULL    NULL    NULL
NULL    NULL    NULL    NULL

Can you please let me know why i am getting null value instead of result, how to do parquet file compression(if it contain json data) in hive external table ? 您能否让我知道为什么我得到空值而不是结果,如何在蜂巢外部表中进行实木复合地板文件压缩(如果它包含json数据)? Can someone help me to compress in hive external table? 有人可以帮我在蜂巢外部表中压缩吗?

Duh! h! You can't compress an existing Parquet file "from outside". 您不能“从外部”压缩现有的Parquet文件。 It's a columnar format with a hellishly complicated internal structure, just like ORC; 就像ORC一样,它是一种柱状格式,内部结构非常复杂。 the file "skeleton" requires fast random access (ie no compression), and each data chunk has to be compressed separately because they are accessed separately. 文件“骨架”需要快速的随机访问(即无压缩),并且每个数据块都必须分别压缩,因为它们是分别访问的。

It's when you create a new Parquet file that you request the SerDe library to compress data inside the file, based on the parquet.compression Hive property. 当您创建要求SERDE库压缩文件数据的新文件的实木复合地板的基础上,这是parquet.compression蜂巢财产。
At read time, the SerDe then checks the compression codec of each data file and decompresses accordingly. 在读取时,SerDe然后检查每个数据文件的压缩编解码器并进行相应的解压缩。

A quick Google search returns a couple of must-reads such as this and that . Google快速搜索会返回一些必读的内容,例如thisthat

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM