[英]How to load json snappy compressed in HIVE
I have a bunch of json snappy compressed files in HDFS.我在 HDFS 中有一堆 json snappy 压缩文件。 They are HADOOP snappy compressed (not python, cf other SO questions) and have nested structures.它们是 HADOOP snappy 压缩的(不是 python,参见其他 SO 问题)并且具有嵌套结构。
Could not find a method to load them into into HIVE (using json_tuple) ?找不到将它们加载到 HIVE 中的方法(使用 json_tuple)?
Can I get some ressources/hints on how to load them我可以获得有关如何加载它们的一些资源/提示吗
Previous references (does not have valid answers)以前的参考文献(没有有效答案)
pyspark how to load compressed snappy file pyspark 如何加载压缩的 snappy 文件
set hive.exec.compress.output=true;
set mapreduce.output.fileoutputformat.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.map.output.compress.codec=org.apache.hadoop.io.compress.SnappyCodec;
set mapreduce.output.fileoutputformat.compress.type=BLOCK;
CREATE EXTERNAL TABLE mydirectory_tbl(
id string,
name string
)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
LOCATION '/mydir' --this is HDFS/S3 location
;
JSONSerDe can parse all complex structures, it is much easier than using json_tuple. JSONSerDe 可以解析所有复杂的结构,比使用 json_tuple 容易多了。 Simple attributes in json are mapped to columns as is All in the square brackets [] is an array<>, in {} is a struct<> or map<>, complex types can be nested. json 中的简单属性按原样映射到列。方括号中的所有 [] 是数组<>,{} 中是 struct<> 或 map<>,复杂类型可以嵌套。 Carefully read Readme: https://github.com/rcongiu/Hive-JSON-Serde .仔细阅读自述文件: https : //github.com/rcongiu/Hive-JSON-Serde 。 There is a section about nested structures and many examples of CREATE TABLE.有一个关于嵌套结构的部分和许多 CREATE TABLE 的例子。
If you still want to use json_tuple, then create table with single STRING column then parse using json_tuple.如果您仍想使用 json_tuple,则创建具有单个 STRING 列的表,然后使用 json_tuple 进行解析。 But it is much more difficult.但要困难得多。
All JSON records should be in single line (no newlines inside JSON objects, as well as \\r) .所有 JSON 记录都应该在单行中(JSON 对象内没有换行符,以及 \\r)。 The same is mentioned here https://github.com/rcongiu/Hive-JSON-Serde这里也提到了https://github.com/rcongiu/Hive-JSON-Serde
If your data is partitioned (ex. by date)如果您的数据已分区(例如按日期)
CREATE EXTERNAL TABLE IF NOT EXISTS database.table (
filename STRING,
cnt BIGINT,
size DOUBLE
) PARTITIONED BY ( \`date\` STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION 'folder/path/in/hdfs'
MSCK REPAIR TABLE database.table
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.