Guide on how to process AST formatted as JSON like structure

Question

With use SQL::Abstract::Tree in Perl, I am able to generate an AST for SQL by:

my $sqlat = SQL::Abstract::Tree->new;
my $tree = $sqlat->parse($query_str);

where $query_str is an SQL query.

As an example, with the query string SELECT cust_id, a as A, z SUM(price) as q, from orders WHERE status > 55 , produces:

[
  [
    "SELECT",
    [
      [
        "-LIST",
        [
          ["-LITERAL", ["cust_id"]],
          ["AS", [["-LITERAL", ["a"]], ["-LITERAL", ["A"]]]],
          [
            "AS",
            [
              ["-LITERAL", ["z"]],
              ["SUM", [["-PAREN", [["-LITERAL", ["price"]]]]]],
              ["-LITERAL", ["q"]],
            ],
          ],
          [],
        ],
      ],
    ],
  ],
  ["FROM", [["-LITERAL", ["orders"]]]],
  [
    "WHERE",
    [[">", [["-LITERAL", ["status"]], ["-LITERAL", [55]]]]],
  ],
]

I would like to walk the AST and derive certain information about it.

I would like to know if there is a guide/tutorial/example source code that walks an AST in this type of format.

Most of the literature I have found considering walking AST's usually assumes I have some sort of class hierarchy describing some sort of variation of the visitor pattern to walk an AST.

My specific use case is converting simple SQL queries to Mongo Queries for the aggregation framework, with some examples given here .

Here is what I have been doing so far:

I first call a parse function with the tree dispatches on each subtree given its type and (which is the first parameter in each subtree,) and calls it with the rest of tree. Here is my parse function:

sub parse {
    my ($tree) = @_;

    my %results = (ret => []);
    for my $subtree (@$tree) {
        my ($node_type, $node) = @$subtree;

        my $result_dic = $dispatch{$node_type}->($node);
        if ($result_dic->{type}) {
             my $type = $result_dic->{type};
             $results{$type} = [] unless $results{$type};
             push $results{$type}, $result_dic->{ret};
             %results = merge_except_for($result_dic, \%results, 'ret', $type);
         }
         else {
             push @{$results{ret}}, @{$result_dic->{ret}};
         }

    }


    return \%results;

}

Which uses the following dispatch table:

my %dispatch = (
    SELECT => sub {

        my $node = shift;
        my $result_dic = parse($node);
        $result_dic->{type} = 'select';
        if ($result_dic->{as}) {
             push $result_dic->{ret}, $result_dic->{as}->[0][0];
         }
        return $result_dic;
    },
    '-LITERAL' => sub {
        my $node = shift;
        my $literal = $node;
        return {ret => $node};
    },
    '-LIST' => sub {
        my $node = shift;
        my $result_dic = parse($node);

        my $ret = flatten_ret($result_dic);

        return flatten_ret($result_dic);
    },
    WHERE => sub {
        my $tree = shift;
        my @bin_ops = qw/= <= < >= >/;

        my $op = $tree->[0];
        if ($op ~~ @bin_ops) {
            # Not yet implemented
        }
        return {ret => ''};

    },
    FROM => sub {
        my $tree = shift;
        my $parse_result = parse($tree);
        return {ret => $parse_result->{ret},
                type => 'database'};
    },
    AS => sub {
        my $node = shift;

        my $result_dic = parse($node);
        $result_dic->{type} = 'as';
        return $result_dic;
    }
);

sub flatten_ret {
    my $result_dic = shift;

    return {ret => [
        map {
            ref($_) ? $_->[0] : $_
        } @{$result_dic->{ret}}]};
}

But I'm not sure about certain things, like if I should be checking if the node name is "AS" in the SELECT subroutine or finding a way to recurse to fill in the data.

Also, what type of data should be returned from each dispatch call and how can I combine it at the end?

Also, I am new to AST processing and looking to get a grip on it, so advice on how I could improve my question would also be appreciated.

Answer 1

Your idea to do typed dispatch is roughly correct. Usually one might use objects and dispatch methods on them. But using a two-element list to tag data with some type works as well. Your misnomed parse function implements this dispatch, and somehow aggregates the output. I am not quite sure what you are trying to achieve with that.

When doing AST transforms it is very useful to keep in mind what exact output you want to create. Let's assume you want to transform

SELECT cust_id, a as A, SUM(price) as q from orders WHERE status > 55

into the data structure

{
  table  => 'orders',
  action => 'aggregate',
  query  => [
    '$match' => { 'status' => { '$gt' => 55 } },
    '$group' => {
       '_id'     => undef,
       'cust_id' => '$cust_id',
       'A'       => '$a',
       'q'       => { '$sum' => '$price' },
    },
  ],
}

What do we have to do for that?

Assert that we have a SELECT ... FROM ... type query.
Set the action to aggregate .
Extract the table name of the FROM entry
Assemble the query:
- For each SELECT item, get the name, and the expression that produces this value.
  - Build each expression recursively
- If a WHERE clause is present, translate each condition recursively.

If we encounter syntax which we cannot parse, throw an error.

Note that my approach starts from the top, and extracts information from deeper in the AST when we need it. This is in contrast to your bottom-up approach that munges all data together and hopes something relevant remains at the end. Especially your hash merging looks dubious.

How can this be implemented? Here is a start:

use Carp;

sub translate_select_statement {
  my ($select, $from, @other_clauses) = @_;
  $select->[0] eq 'SELECT'
    or croak "First clause must be a SELECT clause, not $select->[0]";
  $from->[0] eq 'FROM'
    or croak "Second clause must be a FROM clause, not $from->[0]";

  my $select_list = $select->[1];
  my %groups = (
    _id => undef,
    translate_select_list(get_list_items($select_list)),
  );

  ...
}

sub get_list_items {
  my ($list) = @_;
  if ($list->[0] eq '-LIST') {
    return @{ $list->[1] };
  }
  else {
    # so it's probably just a single item
    return $list;
  }
};

sub translate_select_list {
  my %out;
  for my $item (@_) {
    my ($type, $data) = @$item;
    if ($type eq '-LITERAL') {
      my ($name) = @$data;
      $out{$name} = '$' . $name;
    }
    elsif ($type eq '-AS') {
      my ($expr, $name_literal) = @$data;
      $name_literal->[0] eq '-LITERAL'
        or croak "in 'x AS y' expression, y must be a literal, but it was $name_literal->[0]";
      $out{$name_literal->[1][0]} = translate_expression($expr);
    }
    else {
      croak "I select list, items must be literals or 'x AS y' expression. Found [$type, $data] instead.";
    }
  }
  return %out;
}

sub translate_expression { ... }

The way I structured this, it is much more like a top-down parser, but eg for the translation of arithmetic expression, type dispatch is more important. In the above code, if / else cases are better, because they allow for more validation.

Guide on how to process AST formatted as JSON like structure

Question

1 answers

solution1
1 ACCPTED 2013-10-09 11:48:48

Guide on how to process AST formatted as JSON like structure

Question

1 answers

solution1 1 ACCPTED 2013-10-09 11:48:48

solution1
1 ACCPTED 2013-10-09 11:48:48