简体   繁体   English

hadoop - 验证加载到hive仓库的json数据

[英]hadoop - Validate json data loaded into hive warehouse

I have json files, volume is approx 500 TB. 我有json文件,容量约为500 TB。 I have loaded complete set into hive data warehouse. 我已将完整集加载​​到hive数据仓库中。

How would I validate or test the data that was loaded into hive warehouse. 我如何validate or test the data加载到hive仓库validate or test the data What should be my testing strategy ? 我的testing strategy应该是什么?

Client want us to validate the json data. 客户希望我们验证json数据。 Whether the data loaded into hive is correct ot not. 加载到配置单元中的数据是否正确。 Is there any miss? 有没有错过? If yes, which field it was? 如果是,那是哪个领域?

Please help. 请帮忙。

How is your data being stored in hive tables ? 您的数据如何存储在配置单表中?

One option is create a Hive UDF function that receive the JSON string and validate the data and return another string with the error message or an empty string if the JSON string is well formed. 一个选项是创建一个Hive UDF函数,该函数接收JSON字符串并验证数据并返回带有错误消息的另一个字符串,如果JSON字符串格式正确,则返回空字符串。

Here is a Hve UDF tutorial: http://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html 这是一个Hve UDF教程: http ://blog.matthewrathbone.com/2013/08/10/guide-to-writing-hive-udfs.html

With the Hive UDF function in place you can executequeries like: 使用Hive UDF功能,您可以执行以下查询:

select strjson, validateJson(strjson) from jsonTable where validateJson(strjson) != "";

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM