简体   繁体   中英

how to transforming pseudo xml into flat structure?


I'm trying to parse file that looks like xml but is not. Actualy it is readable version of CRD transformed from ASN1 format. It looks like this:

<PIN rowNum="1">
<CgPa tag="3100.2.960.51" value="1">
<data tag="3100.2.962.56" name="cgPASubscriberIdentifier" value="50212000000089804"/>
<data tag="3100.2.962.60" name="cgPaRoaming" value="1"/>
</CgPa>
<AAA_Common tag="3100.2.960.1" value="1">
<data tag="3100.2.962.12" name="sigSleeId" value="watbf102"/>
<data tag="3100.2.962.34" name="scpAddress" value="48602888950"/>
</AAA_Common>
<evt tag="3100.2.134.28" name="unsupported" value="0"/>
<data tag="3100.2.112.1" name="eventDateTime" value="07/05/2014 19:45:18"/>
<data tag="3100.2.137.4" name="inTriggeringKey" value="0048662221827"/>
<evt tag="3100.2.137.5" name="typeINTriggeringKey" value="1"/>
<CustomerDomain tag="3100.2.134.1" value="1">
<data tag="3100.2.133.1" name="ordinaryClientId" value="50212000000089804"/>
<data tag="3100.2.105.1" name="customerServiceName" value="SO_TT_Roam_Voice"/>
<AccountDomain tag="3100.2.134.3" value="1">
<data tag="3100.2.104.4" name="accountIdentifier" value="50212000000089804"/>
<data tag="3100.2.100.1" name="subscriberType" value="1"/>
<evt tag="3100.2.139.3" name="unsupported" value="0"/>
<TariffDomain tag="3100.2.134.11" value="1">
<data tag="3100.2.106.10" name="tariffPlanNameVersion" value="TT_VOI_R_1_PL_1A_0_RoamB - 2_TCA"/>
</TariffDomain>
<TariffDomain tag="3100.2.134.11" value="1">
<data tag="3100.2.106.10" name="tariffPlanNameVersion" value="TT_VOI_R_1_PL_1A_0_Main - 2_TCA"/>
<data tag="3100.2.106.1" name="tariffPlanName" value="TT_VOI_R_1_PL_1A_0_Main"/>
<evt tag="3100.2.139.9" name="tariffCost" value="1013"/>
<evt tag="3100.2.139.10" name="tariffCostVat" value="1013"/>
<evt tag="3100.2.140.7" name="eventQuantityPerTariff1" value="614"/>
<evt tag="3100.2.142.11" name="usedQuantityPerTariff1" value="614"/>
</TariffDomain>
<evt tag="3100.2.134.29" name="unsupported" value="1"/>
<data tag="3100.2.124.45" name="unsupported" value="07/05/2014 19:45:18"/>
<evt tag="3100.2.139.35" name="unsupported" value="495"/>
<data tag="3100.2.24.11" name="unsupported" value="84490"/>
<evt tag="3100.2.134.30" name="unsupported" value="1"/>
</AccountDomain>
</CustomerDomain>
</PIN>

The main tag for each record is PIN, but sub-tags can appear in random order or don't appear at all. Typical solution for xml cases in pig is to use piggybank function XMLLoader. But it assumes that order of tags is constant. Otherwise we are unnable to put it into schema. The only solution I see it to REGEXP each line and take name and value and use map[] . But what about tags that appear more that ones like TariffDomain in my example? How do I deal with it?

Regards
Pawel

I am just throwing one idea, please let me know if this works for you.
Algorithm:
1. Parse each line and take name and value using REGEX
2. Remove all the null strings
3. Group all the rows based on key
4. Map each key with multiple values as bags

PigScript:  
A  = LOAD 'input.txt' as line;  
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'.*name="(.*)"\\s+value="(.*)".*'))   as(mykey:chararray,myvalue:chararray);  
C = FILTER B BY mykey IS NOT NULL;  
D = GROUP C BY mykey;  
E = FOREACH D GENERATE TOMAP(group,C.myvalue);  
dump  E;  

Output:
([sigSleeId#{(watbf102)}])  
([scpAddress#{(48602888950)}])  
([tariffCost#{(1013)}])  
([cgPaRoaming#{(1)}])  
([unsupported#{(1),(0),(0),(1),(07/05/2014 19:45:18),(495),(84490)}])  
([eventDateTime#{(07/05/2014 19:45:18)}])  
([tariffCostVat#{(1013)}])  
([subscriberType#{(1)}])  
([tariffPlanName#{(TT_VOI_R_1_PL_1A_0_Main)}])  
([inTriggeringKey#{(0048662221827)}])  
([ordinaryClientId#{(50212000000089804)}])  
([accountIdentifier#{(50212000000089804)}])  
([customerServiceName#{(SO_TT_Roam_Voice)}])  
([typeINTriggeringKey#{(1)}])  
([tariffPlanNameVersion#{(TT_VOI_R_1_PL_1A_0_Main - 2_TCA),(TT_VOI_R_1_PL_1A_0_RoamB - 2_TCA)}])  
([usedQuantityPerTariff1#{(614)}])  
([eventQuantityPerTariff1#{(614)}])  
([cgPASubscriberIdentifier#{(50212000000089804)}]) 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM