简体   繁体   中英

Hadoop: querying/reading Avro files

I'm storing data which was imported from complex JSON object to Avro format.

JSON object is represented by object with nested objects and array of objects. Avro Schema looks like this:

{
    "type" : "record",
    "name" : "userInfo",
    "namespace" : "my.example",
    "fields" : [{"name" : "username", 
                 "type" : "string", 
                 "default" : "NONE"},

                {"name" : "age", 
                 "type" : "int",
                 "default" : -1},

                 {"name" : "phone", 
                  "type" : "string", 
                  "default" : "NONE"},

                 {"name" : "housenum", 
                  "type" : "string", 
                  "default" : "NONE"},

                  {"name" : "address", 
                   "type" : {
                         "type" : "record",
                         "name" : "mailing_address",
                         "fields" : [
                            {"name" : "street", 
                             "type" : "string", 
                             "default" : "NONE"},

                            {"name" : "city", 
                             "type" : "string", 
                             "default" : "NONE"},

                            {"name" : "state_prov", 
                             "type" : "string", 
                             "default" : "NONE"},

                            {"name" : "country", 
                             "type" : "string", 
                             "default" : "NONE"},

                            {"name" : "zip", 
                             "type" : "string", 
                             "default" : "NONE"}
                          ]},
                          "default" : {}
                }
    ]
} 

I use NiFi to convert JSON to Avro and to store serialized files in Hadoop (currently I just use pure Hadoop): 在此处输入图片说明

My question:

For test purposes I would like to query data which stored HDFS (Avro format).

So at this point I'm a bit confused because a lot of tools and technologies around Hadoop.. How can I do it in right way? What tools and workflow?

You should be able to create an external Hive table on top of the HDFS location where you wrote the Avro data.

This post has examples:

https://community.hortonworks.com/questions/22135/is-there-a-way-to-create-hive-table-based-on-avro.html

https://cwiki.apache.org/confluence/display/Hive/AvroSerDe

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM