简体   繁体   中英

Regex to find variables and ignore methods

I'm trying to write a regex that finds all variables (and only variables, ignoring methods completely) in a given piece of JavaScript code. The actual code (the one which executes regex) is written in Java.

For now, I've got something like this:

Matcher matcher=Pattern.compile(".*?([a-z]+\\w*?).*?").matcher(string);
while(matcher.find()) {
    System.out.println(matcher.group(1));
}

So, when value of "string" is variable*func()*20

printout is:

variable
func

Which is not what I want. The simple negation of ( won't do, because it makes regex catch unnecessary characters or cuts them off, but still functions are captured. For now, I have the following code:

Matcher matcher=Pattern.compile(".*?(([a-z]+\\w*)(\\(?)).*?").matcher(formula);
while(matcher.find()) {
    if(matcher.group(3).isEmpty()) {
        System.out.println(matcher.group(2));
    }
}

It works, the printout is correct, but I don't like the additional check. Any ideas? Please?

EDIT (2011-04-12):

Thank you for all answers. There were questions, why would I need something like that. And you are right, in case of bigger, more complicated scripts, the only sane solution would be parsing them. In my case, however, this would be excessive. The scraps of JS I'm working on are intented to be simple formulas, something like (a+b)/2 . No comments, string literals, arrays, etc. Only variables and (probably) some built-in functions. I need variables list to check if they can be initalized and this point (and initialized at all). I realize that all of it can be done manually with RPN as well (which would be safer), but these formulas are going to be wrapped with bigger script and evaluated in web browser, so it's more convenient this way.

This may be a bit dirty, but it's assumed that whoever is writing these formulas (probably me, for most of the time), knows what is doing and is able to check if they are working correctly.

If anyone finds this question, wanting to do something similar, should now the risks/difficulties. I do, at least I hope so ;)

It's pretty well established that regex cannot be reliably used to parse structured input. See here for the famous response: RegEx match open tags except XHTML self-contained tags

As any given sequence of characters may or may not change meaning depending on previous or subsequent sequences of characters, you cannot reliably identify a syntactic element without both lexing and parsing the input text. Regex can be used for the former (breaking an input stream into tokens), but cannot be used reliably for the latter (assigning meaning to tokens depending on their position in the stream).

If you are rethinking using regex and wondering what else you could do, you could consider using an AST instead to access your source programatically. This answer shows you could use the Eclipse Java AST to build a syntax tree for Java source. I guess you could do similar for Javascript.

A regex won't cut in this case because Java isn't regular. Your best best is to get a parser that understands Java syntax and build onto that. Luckily, ANTLR has a Java 1.6 grammar (and 1.5 grammar ).

For your rather limited use case you could probably easily extend the variable assignment rules and get the info you need. It's a bit of a learning curve but this will probably be your best best for a quick and accurate solution.

Taking all the sound advice about how regex is not the best tool for the job into consideration is important. But you might get away with a quick and dirty regex if your rule is simple enough (and you are aware of the limitations of that rule):

Pattern regex = Pattern.compile(
    "\\b     # word boundary\n" +
    "[A-Za-z]# 1 ASCII letter\n" +
    "\\w*    # 0+ alnums\n" +
    "\\b     # word boundary\n" +
    "(?!     # Lookahead assertion: Make sure there is no...\n" +
    " \\s*   # optional whitespace\n" +
    " \\(    # opening parenthesis\n" +
    ")       # ...at this position in the string", 
    Pattern.COMMENTS);

This matches an identifier as long as it's not followed by a parenthesis. Of course, now you need group(0) instead of group(1) . And of course this matches lots of other stuff (inside strings, comments, etc.)...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM