简体   繁体   中英

Why grep is not working with regular expression?

I have a regular expression to find functions in files.

See how expression perfectly works in PHP

If I try to run same regex with grep from console, I get an error:

grep -rP "(_t\s*\(\s*([\'\"])(\d+)\2\s*,\s*([\'\"])(.*?)(?<!\\)\4\s*(?(?=,)[^\)]*\s*\)|\)))" application scripts library public data | sort -n | uniq

grep: unrecognized character after (?<

Looks like grep can't handle this part of regex (?<!\\\\) , which is important for me.

Can anyone advise how to modify regex to make grep work with it?

EDIT: String: _t('123', 'pcs.', '', $userLang) . $data['ticker'] . ' (' . $data['security_name'] . ') _t('123', 'pcs.', '', $userLang) . $data['ticker'] . ' (' . $data['security_name'] . ')

Need to find:

  1. index in function ('123')

  2. text in function ('pcs.')

  3. function itself

     > _t('123', 'pcs.', '', $userLang) 

Doing what I said in the comments solves your problem (using the data from the link):

$ cat file
_t('123', 'шт.', '', $userLang)  . $data['ticker'] . ' (' . $data['security_name'] . ')
$ grep -P '(_t\s*\(\s*(['"'"'"])(\d+)\2\s*,\s*(['"'"'"])(.*?)(?<!\\)\4\s*(?(?=,)[^\)]*\s*\)|\)))' file
_t('123', 'шт.', '', $userLang)  . $data['ticker'] . ' (' . $data['security_name'] . ')

The trick here is to use single quotes around the whole regex, then whenever you want a single quote, do '"'"' , which means "close the original string, add a single quote within double quotes, then open a new single-quoted string". Another alternative, as proposed by glglgl , would be to use '\\'' , ie close the original string, add an escaped ' and open a new string.

Using single quotes prevents bash from interpreting the ! as a history expansion. As gniourf_gniourf mentions above The other option would be to disable that behaviour, using set +o history .

Just as a suggestion, if you're looking to capture separate parts of the regex (and you're already using PCRE mode in grep), you could use Perl instead:

$ perl -lne '/(_t\s*\(\s*(['\''"])(\d+)\2\s*,\s*(['\''"])(.*?)(?<!\\)\4\s*(?(?=,)[^\)]*\s*\)|\)))/ && print "group 1: $1\ngroup 3: $3\n group 5: $5"' file
group 1: _t('123', 'шт.', '', $userLang)
group 3: 123
group 5: шт.

I strongly recommend to use the tokenizer extension in order to parse PHP files. This is because parsing a programming language requires a stateful parser, a single regex is stateless and therefore cannot provide this.

Here comes an example how to extract function names from a PHP source file, tracking function calls is possible as well:

$source = file_get_contents('some.php');

$tokens = token_get_all($source);
for($i = 0; $i < count($tokens); $i++) {
    $token = $tokens[$i];
    if(!is_string($token)) {
        if($token[0] === T_FUNCTION) {
            // skip whitespace between the keyword 'function' 
            // and the function's name
            $i+=2;
            // Avoid to print the opening brackets of a closure
            if($tokens[$i][0] === T_STRING) {
                echo $tokens[$i][1] . PHP_EOL;
            }
        }
    }   
}

In comments you told that you also want to parse html, js files. I recommend a DOM/JS parser for that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM