Pattern recognition of SQL statements

Question

I have a text pattern matching problem that I could use some direction with. Not being very familiar with pattern recognition overall, I don't know if this is one of those "oh, just use blah-blah algorithm", or this is a really hard pattern problem.

The general statement of what I want to do is identify similarities between a series of SQL statements, in order to allow me to refactor those statements into a smaller number of stored procedures or other dynamically-generated SQL snippets. For example,

SELECT MIN(foo) FROM bar WHERE baz > 123;
SELECT MIN(footer) FROM bar;
SELECT MIN(foo), baz FROM bar;

are all kind of the same, but I would want to recognize that the value inside the MIN() should be a replaceable value, that I may have another column in the SELECT list, or have an optional WHERE clause. Note that this example is highly cooked up, but I hope it allows you to see what I am after.

In terms of scope, I would have a set of thousands of SQL statements that I would hope to reduce to dozens (?) of generic statements. In research so far, I have come across w-shingles, and n-grams, and have discarded approaches like "bag of words" because ordering is important. Taking this out of the SQL realm, another way of stating this problem might be "given a series of text statements, what is the smallest set of text fragments that can be used to reassemble those statements?"

Answer 1

What you really want is to find code clones across the code base.

There's a lot of ways to do that, but most of them seem to ignore the structure that the (SQL) language brings. That structure makes it "easier" to find code elements that make conceptual sense, as opposed to say N-grams (yes, "FROM x WHERE" is common but is an awkward chunk of SQL).

My abstract syntax tree (AST) based clone detection scheme parses source text to ASTs, and then finds shared trees that can be parameterized in a way that produces sensible generalizations by using the language grammar as a guide. See my technical paper Clone Detection Using Abstract Syntax Trees .

With respect to OP's example:

It will recognize that the value inside the MIN() should be a replaceable value
that the SELECT singleton column could be extended to a list
and that the WHERE clause is optional

It won't attempt to make those proposals, unless it find two candidate clones that vary in way these generalizations explain. It gets the generalizations basically by extracting them from the (SQL) grammar. OP's examples have exactly enough variation to force those generalizations.

A survey of clone detection techniques ( Comparison and Evaluation of Code Clone Detection Techniques and Tools: A Qualitative Approach ) rated this approach at the top of some 30 different clone detection methods; see table 14.

Answer 2

Question is a bit too broad, but I would suggest to give a shot the following approach:

This sounds like a document clustering problem, where you have a set of pieces of text (SQL statements) and you want to cluster them together to find if some of the statements are close to each other. Now, the trick here is in the distance measure between text statements. I would try something like edit distance

So in general the following approach could work:

Do some preprocessing of the sql statements you have. Tokenization, removing some words from statements etc. Just be careful here - you are not analysing just some natural language text, its an SQL statements so you will need some clever approach.
After that, try to write a function which would count distance between 2 sql queries. The edit distance should work for you.
Finally, try to run document clustering on all your SQL queries, using edit distance as a distance measure for clustering algorithm

Hope that helps.

Pattern recognition of SQL statements

Question

2 answers

solution1
3 ACCPTED 2016-10-18 00:42:55

solution2
1 2016-10-17 23:27:21

Pattern recognition of SQL statements

Question

2 answers

solution1 3 ACCPTED 2016-10-18 00:42:55

solution2 1 2016-10-17 23:27:21

solution1
3 ACCPTED 2016-10-18 00:42:55

solution2
1 2016-10-17 23:27:21