简体   繁体   中英

SAS group by counters per variable - primary key creation

I have some data which needs to be split into 12 or so different groups, there is no key and the order the data is in is important.

The data has a number of groups and those groups have singular and / or nested groups within that. Each group will be split out as the data is in a hierarchical format. so each "GROUP" then has its own format which then all needs to be joined up on one line (or many) rows.

Sample data file:

"TRANS","23115168","","","OTVST","","23115168","","COMLT","","",20180216,"OAMI","501928",,
"MTPNT","UPDTE",2415799999,"","","17","","",,20180216,
"ASSET","","REPRT","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","REMVE","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","INSTL","METER","","CR","E6VG470","LPG",2017,"E6S06769699999","","","LI"
"METER","","U","S1",6.0000,"","",20180216,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00000"
"APPNT","",20180216,,"","123900",""

The hierarchy that should exist when data is input. I am thinking there could be several tables that can be joined together later. (numbers for illustration of parent child levels)

1. Transaction [TRANS]
   1.1. Meter Point [MTPNT]
      1.1.1. Asset [ASSET]
         1.1.1.1. Meter [METER]
         1.1.1.2. Converter [CONVE]
         1.1.1.3. Register Details [REGST]
            1.1.1.3.1. Reading [READG]
         1.1.1.4. Market Participant [MKPRT]
         1.1.1.5. Name [NAME]
            1.1.1.5.1. Address [ADDRS]
            1.1.1.5.2. Contact Mechanism [CONTM]
   1.2. Appointment [APPNT]
   1.3. Name [NAME]
      1.3.1. Address [ADDRS]
      1.3.2. Contact Mechanism [CONTM]
   1.4. Market Participant [MKPRT]

The industry GAS data, so in this flow you can have many ASSET per MTPNT, and those many ASSET can have many REGST because this is where the meter reading is kept for READG

I have tried using by groups and iterative first. processing, but i have not worked with this type of data before. I need a way to split create a key per grouping, which when split up and the fields are defined, can be joined back together.

I have tried manipulating the infile so that all the data appears on one line per TRANS, but then i still have the issue of applying the fields, and ordering is paramount.

I have managed to get a few keys for some of the groups, but after splitting they dont quite join back together.

data TRANS;
    set mpancreate_a;
    by DataItmGrp NOTSORTED;

    if first.DataItmGrp then 
        do;
            if DataItmGrp = "TRANS" then 
                TRANSKey+1;
        end;
run;

data TRANS;
    set TRANS;
    TRANSKey2 + 1;
    by DataItmGrp NOTSORTED;

    if first.DataItmGrp then
        do;
            if DataItmGrp = "TRANS" then
                TRANSKEY2=1;
        end;


run;

data MTPNT;
    set TRANS;
    by DataItmGrp NOTSORTED;

    if first.DataItmGrp then
        do;
            if DataItmGrp = "MTPNT" then
                MTPNTKEY+1;
        end;
run;

data MTPNT;
    set MTPNT;
    by  MTPNTKEY NOTSORTED;

    if first.MTPNTKEY  and DataItmGrp = "MTPNT" then
        MTPNTKEY2=0;
    MTPNTKEY2+1;
run;

data ASSET;
    set MTPNT;

    IF MTPNTKEY = 0 THEN
        MTPNTKEY2=0;
    by DataItmGrp NOTSORTED;

    if first.DataItmGrp then
        do;
            if DataItmGrp = "ASSET" then
                ASSETKEY+1;
        end;
run;

data ASSET;
    set ASSET;
    by  ASSETKEY NOTSORTED;

    if first.ASSETKEY  and DataItmGrp = "ASSET" then
        ASSETKEY2=0;
    ASSETKEY2+1;

    IF ASSETKEY =0 THEN
        ASSETKEY2=0;
run;

i want a counter for each group found, and a retained counter for that particular group - but i cannot work out how to get in and out of the groupings based on the hierarchy above

i'm hoping that once i have these keys, i can split the data by group and then left join back together


        _n_     TRANS   TRANS2  MTPNT   MTPNT2
TRANS   1       1       0       0       0
MTPNT   2       2       1       1       1
ASSET   3       3       1       2       1
METER   4       4       1       3       1
READG   5       5       1       4       1
MTPNT   6       6       1       1       2
ASSET   7       7       1       2       2
METER   8       8       1       3       2
READG   9       9       1       4       2
APPNT   10      10      1       5       2
TRANS   11      1       2       6       2
MTPNT   12      2       2       1       3
ASSET   13      3       2       2       3
METER   14      4       2       3       3
READG   15      5       2       4       3
MTPNT   16      6       2       1       4
ASSET   17      7       2       2       4
METER   18      8       2       3       4
READG   19      9       2       4       4
APPNT   20      10      2       5       4   




The input of hierarchical data from a data file that has no definitive markers is problematic. The best suggestion I have is to understand what are the salient values you want to extract and in what context do you want to know them. For this problem a simplest first approach would be to have a single monolithic table with categorical variables to capture the path that descends to the salient value (meter reading).

A more complex situation would be the first token in each line drives the input for that line and the output table it belongs to. Since there are no landmarks as to hierarchy absolute or relative position (as in the NAME and MKPRT) there is no 100% confident way to place them in the hierarchy and that can also affect the placement of items read-in from subsequent data lines.

Depending on the true complexity and adherence to rules in the real world you may or may not 'miss out' the reading of some values.

Suppose there is the simpler goal of just getting the meter readings.

data want;

length tier level1-level6 $8 path $64 meterReadingString $8 dummy $1;
retain level1-level5 path;
attrib readingdate informat=yymmdd10. format=yymmdd10.;

infile cards dsd missover;

input @1 tier @; * held input - dont advance read line yet;

if tier="TRANS" then do;
  level1 = tier;
  call missing (of level2-level6);
  path = catx("/", of level:);
end;

if tier="MTPNT" and path="TRANS" then do;
  level2 = tier;
  call missing (of level3-level6);
  path = catx("/", of level:);
end;

if tier="ASSET" and path="TRANS/MTPNT" then do;
  level3 = tier;
  call missing (of level4-level6);
  path = catx("/", of level:);
end;

if tier="METER" and path="TRANS/MTPNT/ASSET" then do;
  level4 = tier;
  call missing (of level5-level6);
  path = catx("/", of level:);
end;

if tier="REGST" and path="TRANS/MTPNT/ASSET/METER" then do;
  level5 = tier;
  call missing (of level6-level6);
  path = catx("/", of level:);
end;

if tier="READG" and path="TRANS/MTPNT/ASSET/METER/REGST" then do;
  level6 = tier;
  path = catx("/", of level:);
  input @1 tier readingdate dummy meterReadingString @; * reread line according to tier;

  meterReading = input(meterReadingString, best12.);

  if path = "TRANS/MTPNT/ASSET/METER/REGST/READG" then OUTPUT;
end;    

datalines;
"TRANS","23115168","","","OTVST","","23115168","","COMLT","","",20180216,"OAMI","501928",,
"MTPNT","UPDTE",2415799999,"","","17","","",,20180216,
"ASSET","","REPRT","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","REMVE","METER","","CR","E6VG470","LPG",2017,"E6S05633099999","","","LI"
"METER","","U","S1",6.0000,"","",20171108,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00990"
"ASSET","","INSTL","METER","","CR","E6VG470","LPG",2017,"E6S06769699999","","","LI"
"METER","","U","S1",6.0000,"","",20180216,"S",,
"REGST","","METER",5,"SCMH",1.000
"READG",20180216,,"00000"
"APPNT","",20180216,,"","123900",""
run;

You can use this as the basis of a more complicated reader that has a different output <tier> data set for each tier or path to tier encountered. You would need a different input statement per tier, similar to how READG is read.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM