简体   繁体   中英

Parse complex Json string contained in Hadoop

I want to parse a string of complex JSON in Pig. Specifically, I want Pig to understand my JSON array as a bag instead of as a single chararray. I found that complex JSON can be parsed by using Twitter's Elephant Bird or Mozilla's Akela library. (I found some additional libraries, but I cannot use 'Loader' based approach since I use HCatalog Loader to load data from Hive.)

But, the problem is the structure of my data; each value of Map structure contains value part of complex JSON. For example,

1. My table looks like  (WARNING: type of 'complex_data' is not STRING, a MAP of <STRING, STRING>!)
TABLE temp_table
(
    user_id BIGINT COMMENT 'user ID.',
    complex_data MAP <STRING, STRING> COMMENT 'complex json data'
)
COMMENT 'temp data.'
PARTITIONED BY(created_date STRING)
STORED AS RCFILE;


2. And 'complex_data' contains (a value that I want to get is marked with two *s, so basically #'d'#'f' from each PARSED_STRING(complex_data#'c')  )
{ "a": "[]", 
  "b": "\"sdf\"", 
  "**c**":"[{\"**d**\":{\"e\":\"sdfsdf\"
                      ,\"**f**\":\"sdfs\"
                      ,\"g\":\"qweqweqwe\"},
             \"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
                   {\"d\":21321,\"e\":\"ewrwer\"},
                   {\"d\":21321,\"e\":\"ewrwer\"}]
            },
            {\"**d**\":{\"e\":\"sdfsdf\"
                      ,\"**f**\":\"sdfs\"
                      ,\"g\":\"qweqweqwe\"},
             \"c\":[{\"d\":21321,\"e\":\"ewrwer\"},
                   {\"d\":21321,\"e\":\"ewrwer\"},
                   {\"d\":21321,\"e\":\"ewrwer\"}]
            },]"
}

3. So, I tried... (same approach for Elephant Bird)

REGISTER '/path/to/akela-0.6-SNAPSHOT.jar';
DEFINE JsonTupleMap com.mozilla.pig.eval.json.JsonTupleMap();

data = LOAD temp_table USING org.apache.hive.hcatalog.pig.HCatLoader();
values_of_map = FOREACH data GENERATE complex_data#'c' AS attr:chararray;    -- IT WORKS

-- dump values_of_map shows correct chararray data per each row
-- eg) ([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
         {"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
         {"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }])
       ([{"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
         {"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... },
         {"d":{"e":"sdfsdf","f":"sdfs","g":"sdf"},... }]) ...

attempt1 = FOREACH data GENERATE JsonTupleMap(complex_data#'c');   -- THIS LINE CAUSE AN ERROR 
attempt2 = FOREACH data GENERATE JsonTupleMap(CONCAT(CONCAT('{\\"key\\":', complex_data#'c'), '}');   -- IT ALSO DOSE NOT WORK 

I guessed that "attempt1" was failed because the value doesn't contain full JSON. However, when I CONCAT like "attempt2", I generate additional \\ mark with. (so each line starts with {\\"key\\": ) I'm not sure that this additional marks breaks the parsing rule or not. In any case, I want to parse the given JSON string so that Pig can understand. If you have any method or solution, please Feel free to let me know.

I finally solved my problem by using jyson library with jython UDF . I know that I can solve it by using JAVA or other languages. But, I think that jython with jyson is the most simplist answer to this issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM