如何加载在 HIVE 中压缩的 json snappy

Question

I have a bunch of json snappy compressed files in HDFS.我在 HDFS 中有一堆 json snappy 压缩文件。 They are HADOOP snappy compressed (not python, cf other SO questions) and have nested structures.它们是 HADOOP snappy 压缩的（不是 python，参见其他 SO 问题）并且具有嵌套结构。

Could not find a method to load them into into HIVE (using json_tuple) ?找不到将它们加载到 HIVE 中的方法（使用 json_tuple）？

Can I get some ressources/hints on how to load them我可以获得有关如何加载它们的一些资源/提示吗

Previous references (does not have valid answers)以前的参考文献（没有有效答案）

pyspark how to load compressed snappy file pyspark 如何加载压缩的 snappy 文件

Hive: parsing JSON Hive：解析 JSON

Answer 1

Put all files in HDFS folder and create external table on top of it.将所有文件放在 HDFS 文件夹中并在其上创建外部表。 If files have names like .snappy Hive will automatically recognize them.如果文件的名称类似于 .snappy Hive 将自动识别它们。 You can specify SNAPPY output format for writing table:您可以为写表指定 SNAPPY 输出格式：


set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
 


CREATE EXTERNAL TABLE mydirectory_tbl(
  id   string,
  name string
)
ROW FORMAT SERDE
  'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/mydir' --this is HDFS/S3 location
;

JSONSerDe can parse all complex structures, it is much easier than using json_tuple. JSONSerDe 可以解析所有复杂的结构，比使用 json_tuple 容易多了。 Simple attributes in json are mapped to columns as is All in the square brackets [] is an array<>, in {} is a struct<> or map<>, complex types can be nested. json 中的简单属性按原样映射到列。方括号中的所有 [] 是数组<>，{} 中是 struct<> 或 map<>，复杂类型可以嵌套。 Carefully read Readme: https://github.com/rcongiu/Hive-JSON-Serde .仔细阅读自述文件： https : //github.com/rcongiu/Hive-JSON-Serde 。 There is a section about nested structures and many examples of CREATE TABLE.有一个关于嵌套结构的部分和许多 CREATE TABLE 的例子。
If you still want to use json_tuple, then create table with single STRING column then parse using json_tuple.如果您仍想使用 json_tuple，则创建具有单个 STRING 列的表，然后使用 json_tuple 进行解析。 But it is much more difficult.但要困难得多。
All JSON records should be in single line (no newlines inside JSON objects, as well as \\r) .所有 JSON 记录都应该在单行中（JSON 对象内没有换行符，以及 \\r）。 The same is mentioned here https://github.com/rcongiu/Hive-JSON-Serde这里也提到了https://github.com/rcongiu/Hive-JSON-Serde

Answer 2

If your data is partitioned (ex. by date)如果您的数据已分区（例如按日期）

Create the table in Hive在 Hive 中创建表

CREATE EXTERNAL TABLE IF NOT EXISTS database.table (
  filename STRING,
  cnt BIGINT,
  size DOUBLE
) PARTITIONED BY (   \`date\` STRING ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'  
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'folder/path/in/hdfs'

Recover the partition (before the recovery, the table seems to be empty)恢复分区（恢复前，表好像是空的）

MSCK REPAIR TABLE database.table

如何加载在 HIVE 中压缩的 json snappy

问题描述

2 个解决方案

解决方案1
4 已采纳 2020-10-21 15:55:17

解决方案2
2 2020-10-27 00:03:54

Create the table in Hive在 Hive 中创建表

Recover the partition (before the recovery, the table seems to be empty)恢复分区（恢复前，表好像是空的）

如何加载在 HIVE 中压缩的 json snappy

问题描述

2 个解决方案

解决方案1 4 已采纳 2020-10-21 15:55:17

解决方案2 2 2020-10-27 00:03:54

Create the table in Hive在 Hive 中创建表

Recover the partition (before the recovery, the table seems to be empty)恢复分区（恢复前，表好像是空的）

解决方案1
4 已采纳 2020-10-21 15:55:17

解决方案2
2 2020-10-27 00:03:54