简体   繁体   English

用于将PCRE正则表达式转换为emacs正则表达式的Elisp机制

[英]Elisp mechanism for converting PCRE regexps to emacs regexps

I admit significant bias toward liking PCRE regexps much better than emacs, if no no other reason that when I type a '(' I pretty much always want a grouping operator. And, of course, \\w and similar are SO much more convenient than the other equivalents. 如果没有其他原因,当我输入'('我几乎总是想要一个分组操作符时,我承认喜欢PCRE正则表达式比使用emacs更好的偏见。当然,\\ w和类似的东西比使用它更方便其他等价物。

But it would be crazy to expect to change the internals of emacs, of course. 但是,当然,期望改变emacs的内部结构会很疯狂。 But it should be possible to convert from a PCRE experssion to an emacs expression, I'd think, and do all the needed conversions so I can write: 但我认为应该可以从PCRE表达式转换为emacs表达式,并执行所有需要的转换,以便我可以编写:

(defun my-super-regexp-function ...
   (search-forward (pcre-convert "__\\w: \d+")))

(or similar). (或类似的)。

Anyone know of a elisp library that can do this? 有人知道可以做到这一点的elisp库吗?


Edit: Selecting a response from the answers below... 编辑:从以下答案中选择回复...

Wow, I love coming back from 4 days of vacation to find a slew of interesting answers to sort through! 哇,我喜欢从4天的休假回来,找到一系列有趣的答案! I love the work that went into the solutions of both types. 我喜欢这两种解决方案的工作。

In the end, it looks like both the exec-a-script and straight elisp versions of the solutions would both work, but from a pure speed and "correctness" approach the elisp version is certainly the one that people would prefer (myself included). 最后,看起来exec-a-script和直接elisp版本的解决方案都可以工作,但从纯粹的速度和“正确性”方法来看,elisp版本肯定是人们更喜欢的版本(包括我自己) 。

https://github.com/joddie/pcre2el is the up-to-date version of this answer. https://github.com/joddie/pcre2el是这个答案的最新版本。

pcre2el or rxt (RegeXp Translator or RegeXp Tools) is a utility for working with regular expressions in Emacs, based on a recursive-descent parser for regexp syntax. pcre2elrxt (RegeXp Translator或RegeXp Tools)是一个用于在Emacs中处理正则表达式的实用程序,它基于regexp语法的递归下降解析器。 In addition to converting (a subset of) PCRE syntax into its Emacs equivalent, it can do the following: 除了将(一部分)PCRE语法转换为其Emacs等效语言之外,它还可以执行以下操作:

  • convert Emacs syntax to PCRE 将Emacs语法转换为PCRE
  • convert either syntax to rx , an S-expression based regexp syntax 将语法转换为rx ,一种基于S表达式的regexp语法
  • untangle complex regexps by showing the parse tree in rx form and highlighting the corresponding chunks of code 通过以rx形式显示解析树并突出显示相应的代码块来解开复杂的正则表达式
  • show the complete list of strings (productions) matching a regexp, provided the list is finite 显示匹配正则表达式的完整字符串(产品)列表,前提是列表是有限的
  • provide live font-locking of regexp syntax (so far only for Elisp buffers – other modes on the TODO list) 提供regexp语法的实时字体锁定(到目前为止仅适用于Elisp缓冲区 - TODO列表中的其他模式)

The text of the original answer follows... 原始答案的内容如下......


Here's a quick and ugly Emacs lisp solution (EDIT: now located more permanently here ). 这是一个快速而丑陋的Emacs lisp解决方案 (编辑:现在更加永久地定位在这里 )。 It's based mostly on the description in the pcrepattern man page, and works token by token, converting only the following constructions: 它主要基于pcrepattern手册页中的描述,并且通过令牌工作令牌,仅转换以下构造:

  • parenthesis grouping ( .. ) 括号分组( .. )
  • alternation | 交替|
  • numerical repeats {M,N} 数字重复{M,N}
  • string quoting \\Q .. \\E 字符串引用\\Q .. \\E
  • simple character escapes: \\a , \\c , \\e , \\f , \\n , \\r , \\t , \\x , and \\ + octal digits 简单字符转义: \\a\\c\\e\\f\\n\\r\\t\\x\\ +八进制数字
  • character classes: \\d , \\D , \\h , \\H , \\s , \\S , \\v , \\V 字符类: \\d\\D\\h\\H\\s\\S\\v\\V
  • \\w and \\W left as they are (using Emacs' own idea of word and non-word characters) \\w\\W保持不变(使用Emacs自己的单词和非单词字符的想法)

It doesn't do anything with more complicated PCRE assertions, but it does try to convert escapes inside character classes. 它不会对更复杂的PCRE断言做任何事情,但它会尝试在字符类中转换转义。 In the case of character classes including something like \\D , this is done by converting into a non-capturing group with alternation. 在包括类似\\D的字符类的情况下,这通过转换为具有交替的非捕获组来完成。

It passes the tests I wrote for it, but there are certainly bugs, and the method of scanning token-by-token is probably slow. 它通过了我为它编写的测试,但肯定存在错误,而且逐个令牌扫描的方法可能很慢。 In other words, no warranty. 换句话说,没有保修。 But perhaps it will do enough of the simpler part of the job for some purposes. 但也许它会为某些目的做足够多的工作。 Interested parties are invited to improve it ;-) 欢迎有兴趣的人士改进;-)

(eval-when-compile (require 'cl))

(defvar pcre-horizontal-whitespace-chars
  (mapconcat 'char-to-string
             '(#x0009 #x0020 #x00A0 #x1680 #x180E #x2000 #x2001 #x2002 #x2003
                      #x2004 #x2005 #x2006 #x2007 #x2008 #x2009 #x200A #x202F
                      #x205F #x3000)
             ""))

(defvar pcre-vertical-whitespace-chars
  (mapconcat 'char-to-string
             '(#x000A #x000B #x000C #x000D #x0085 #x2028 #x2029) ""))

(defvar pcre-whitespace-chars
  (mapconcat 'char-to-string '(9 10 12 13 32) ""))

(defvar pcre-horizontal-whitespace
  (concat "[" pcre-horizontal-whitespace-chars "]"))

(defvar pcre-non-horizontal-whitespace
  (concat "[^" pcre-horizontal-whitespace-chars "]"))

(defvar pcre-vertical-whitespace
  (concat "[" pcre-vertical-whitespace-chars "]"))

(defvar pcre-non-vertical-whitespace
  (concat "[^" pcre-vertical-whitespace-chars "]"))

(defvar pcre-whitespace (concat "[" pcre-whitespace-chars "]"))

(defvar pcre-non-whitespace (concat "[^" pcre-whitespace-chars "]"))

(eval-when-compile
  (defmacro pcre-token-case (&rest cases)
    "Consume a token at point and evaluate corresponding forms.

CASES is a list of `cond'-like clauses, (REGEXP FORMS
...). Considering CASES in order, if the text at point matches
REGEXP then moves point over the matched string and returns the
value of FORMS. Returns `nil' if none of the CASES matches."
    (declare (debug (&rest (sexp &rest form))))
    `(cond
      ,@(mapcar
         (lambda (case)
           (let ((token (car case))
                 (action (cdr case)))
             `((looking-at ,token)
               (goto-char (match-end 0))
               ,@action)))
         cases)
      (t nil))))

(defun pcre-to-elisp (pcre)
  "Convert PCRE, a regexp in PCRE notation, into Elisp string form."
  (with-temp-buffer
    (insert pcre)
    (goto-char (point-min))
    (let ((capture-count 0) (accum '())
          (case-fold-search nil))
      (while (not (eobp))
        (let ((translated
               (or
                ;; Handle tokens that are treated the same in
                ;; character classes
                (pcre-re-or-class-token-to-elisp)   

                ;; Other tokens
                (pcre-token-case
                 ("|" "\\|")
                 ("(" (incf capture-count) "\\(")
                 (")" "\\)")
                 ("{" "\\{")
                 ("}" "\\}")

                 ;; Character class
                 ("\\[" (pcre-char-class-to-elisp))

                 ;; Backslash + digits => backreference or octal char?
                 ("\\\\\\([0-9]+\\)"
                  (let* ((digits (match-string 1))
                         (dec (string-to-number digits)))
                    ;; from "man pcrepattern": If the number is
                    ;; less than 10, or if there have been at
                    ;; least that many previous capturing left
                    ;; parentheses in the expression, the entire
                    ;; sequence is taken as a back reference.   
                    (cond ((< dec 10) (concat "\\" digits))
                          ((>= capture-count dec)
                           (error "backreference \\%s can't be used in Emacs regexps"
                                  digits))
                          (t
                           ;; from "man pcrepattern": if the
                           ;; decimal number is greater than 9 and
                           ;; there have not been that many
                           ;; capturing subpatterns, PCRE re-reads
                           ;; up to three octal digits following
                           ;; the backslash, and uses them to
                           ;; generate a data character. Any
                           ;; subsequent digits stand for
                           ;; themselves.
                           (goto-char (match-beginning 1))
                           (re-search-forward "[0-7]\\{0,3\\}")
                           (char-to-string (string-to-number (match-string 0) 8))))))

                 ;; Regexp quoting.
                 ("\\\\Q"
                  (let ((beginning (point)))
                    (search-forward "\\E")
                    (regexp-quote (buffer-substring beginning (match-beginning 0)))))

                 ;; Various character classes
                 ("\\\\d" "[0-9]")
                 ("\\\\D" "[^0-9]")
                 ("\\\\h" pcre-horizontal-whitespace)
                 ("\\\\H" pcre-non-horizontal-whitespace)
                 ("\\\\s" pcre-whitespace)
                 ("\\\\S" pcre-non-whitespace)
                 ("\\\\v" pcre-vertical-whitespace)
                 ("\\\\V" pcre-non-vertical-whitespace)

                 ;; Use Emacs' native notion of word characters
                 ("\\\\[Ww]" (match-string 0))

                 ;; Any other escaped character
                 ("\\\\\\(.\\)" (regexp-quote (match-string 1)))

                 ;; Any normal character
                 ("." (match-string 0))))))
          (push translated accum)))
      (apply 'concat (reverse accum)))))

(defun pcre-re-or-class-token-to-elisp ()
  "Consume the PCRE token at point and return its Elisp equivalent.

Handles only tokens which have the same meaning in character
classes as outside them."
  (pcre-token-case
   ("\\\\a" (char-to-string #x07))  ; bell
   ("\\\\c\\(.\\)"                  ; control character
    (char-to-string
     (- (string-to-char (upcase (match-string 1))) 64)))
   ("\\\\e" (char-to-string #x1b))  ; escape
   ("\\\\f" (char-to-string #x0c))  ; formfeed
   ("\\\\n" (char-to-string #x0a))  ; linefeed
   ("\\\\r" (char-to-string #x0d))  ; carriage return
   ("\\\\t" (char-to-string #x09))  ; tab
   ("\\\\x\\([A-Za-z0-9]\\{2\\}\\)"
    (char-to-string (string-to-number (match-string 1) 16)))
   ("\\\\x{\\([A-Za-z0-9]*\\)}"
    (char-to-string (string-to-number (match-string 1) 16)))))

(defun pcre-char-class-to-elisp ()
  "Consume the remaining PCRE character class at point and return its Elisp equivalent.

Point should be after the opening \"[\" when this is called, and
will be just after the closing \"]\" when it returns."
  (let ((accum '("["))
        (pcre-char-class-alternatives '())
        (negated nil))
    (when (looking-at "\\^")
      (setq negated t)
      (push "^" accum)
      (forward-char))
    (when (looking-at "\\]") (push "]" accum) (forward-char))

    (while (not (looking-at "\\]"))
      (let ((translated
             (or
              (pcre-re-or-class-token-to-elisp)
              (pcre-token-case              
               ;; Backslash + digits => always an octal char
               ("\\\\\\([0-7]\\{1,3\\}\\)"    
                (char-to-string (string-to-number (match-string 1) 8)))

               ;; Various character classes. To implement negative char classes,
               ;; we cons them onto the list `pcre-char-class-alternatives' and
               ;; transform the char class into a shy group with alternation
               ("\\\\d" "0-9")
               ("\\\\D" (push (if negated "[0-9]" "[^0-9]")
                              pcre-char-class-alternatives) "")
               ("\\\\h" pcre-horizontal-whitespace-chars)
               ("\\\\H" (push (if negated
                                  pcre-horizontal-whitespace
                                pcre-non-horizontal-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\\\s" pcre-whitespace-chars)
               ("\\\\S" (push (if negated
                                  pcre-whitespace
                                pcre-non-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\\\v" pcre-vertical-whitespace-chars)
               ("\\\\V" (push (if negated
                                  pcre-vertical-whitespace
                                pcre-non-vertical-whitespace)
                              pcre-char-class-alternatives) "")
               ("\\\\w" (push (if negated "\\W" "\\w") 
                              pcre-char-class-alternatives) "")
               ("\\\\W" (push (if negated "\\w" "\\W") 
                              pcre-char-class-alternatives) "")

               ;; Leave POSIX syntax unchanged
               ("\\[:[a-z]*:\\]" (match-string 0))

               ;; Ignore other escapes
               ("\\\\\\(.\\)" (match-string 0))

               ;; Copy everything else
               ("." (match-string 0))))))
        (push translated accum)))
    (push "]" accum)
    (forward-char)
    (let ((class
           (apply 'concat (reverse accum))))
      (when (or (equal class "[]")
                (equal class "[^]"))
        (setq class ""))
      (if (not pcre-char-class-alternatives)
          class
        (concat "\\(?:"
                class "\\|"
                (mapconcat 'identity
                           pcre-char-class-alternatives
                           "\\|")
                "\\)")))))

I made a few minor modifications to a perl script I found on perlmonks (to take values from the command line) and saved it as re_pl2el.pl (given below). 对perlmonks上的perl脚本做了一些小修改(从命令行获取值)并将其保存为re_pl2el.pl (如下所示)。 Then the following does a decent job of converting PCRE to elisp regexps, at least for non-exotic the cases that I tested. 然后,以下是将PCRE转换为elisp正则表达式的不错的工作,至少对于我测试过的非外来情况。

(defun pcre-to-elre (regex)
  (interactive "MPCRE expression: ")
  (shell-command-to-string (concat "re_pl2el.pl -i -n "
                                   (shell-quote-argument regex))))

(pcre-to-elre "__\\w: \\d+") ;-> "__[[:word:]]: [[:digit:]]+"

It doesn't handle a few "corner" cases like perl's shy {N,M}? 它没有像perl的害羞{N,M}?那样处理一些“角落”案件{N,M}? constructs, and of course not code execution etc. but it might serve your needs or be a good starting place for such. 构造,当然不是代码执行等,但它可能满足您的需求或是一个良好的起点。 Since you like PCRE I presume you know enough perl to fix any cases you use often. 因为你喜欢PCRE我认为你知道足够的perl来修复你经常使用的任何情况。 If not let me know and we can probably fix them. 如果不让我知道,我们可以解决它们。

I would be happier with a script that parsed the regex into an AST and then spit it back out in elisp format (since then it could spit it out in rx format too), but I couldn't find anything doing that and it seemed like a lot of work when I should be working on my thesis. 我会更乐意使用一个脚本将正则表达式解析为AST,然后以elisp格式将其吐出(从那以后它也可以用rx格式吐出来),但我找不到任何东西这样做,看起来像我应该在撰写论文时做很多工作。 :-) I find it hard to believe that noone has done it though. :-)我发现很难相信没有人这样做过。

Below is my "improved" version of re_pl2el.pl. 下面是我的“改进版”re_pl2el.pl。 -i means don't double escape for strings, and -n means don't print a final newline. -i表示不对字符串进行双重转义, -n表示不打印最终换行符。

#! /usr/bin/perl
#
# File: re_pl2el.pl
# Modified from http://perlmonks.org/?node_id=796020
#
# Description:
#
use strict;
use warnings;

# version 0.4


# TODO
# * wrap converter to function
# * testsuite

#--- flags
my $flag_interactive; # true => no extra escaping of backslashes
if ( int(@ARGV) >= 1 and $ARGV[0] eq '-i' ) {
    $flag_interactive = 1;
    shift @ARGV;
}

if ( int(@ARGV) >= 1 and $ARGV[0] eq '-n' ) {
    shift @ARGV;
} else {
    $\="\n";
}

if ( int(@ARGV) < 1 ) {
    print "usage: $0 [-i] [-n] REGEX";
    exit;
}

my $RE='\w*(a|b|c)\d\(';
$RE='\d{2,3}';
$RE='"(.*?)"';
$RE="\0".'\"\t(.*?)"';
$RE=$ARGV[0];

# print "Perlcode:\t $RE";

#--- encode all \0 chars as escape sequence
$RE=~s#\0#\\0#g;

#--- substitute pairs of backslashes with \0
$RE=~s#\\\\#\0#g;

#--- hide escape sequences of \t,\n,... with
#    corresponding ascii code
my %ascii=(
       t =>"\t",
       n=> "\n"
      );
my $kascii=join "|",keys %ascii;

$RE=~s#\\($kascii)#$ascii{$1}#g;


#---  normalize needless escaping
# e.g.  from /\"/ to /"/, since it's no difference in perl
# but might confuse elisp

$RE=~s#\\"#"#g;

#--- toggle escaping of 'backslash constructs'
my $bsc='(){}|';
$RE=~s#[$bsc]#\\$&#g;  # escape them once
$RE=~s#\\\\##g;        # and erase double-escaping



#--- replace character classes
my %charclass=(
        w => 'word' ,   # TODO: emacs22 already knows \w ???
        d => 'digit',
        s => 'space'
       );

my $kc=join "|",keys %charclass;
$RE=~s#\\($kc)#[[:$charclass{$1}:]]#g;



#--- unhide pairs of backslashes
$RE=~s#\0#\\\\#g;

#--- escaping for elisp string
unless ($flag_interactive){
  $RE=~s#\\#\\\\#g; # ... backslashes
  $RE=~s#"#\\"#g;   # ... quotes
}

#--- unhide escape sequences of \t,\n,...
my %rascii= reverse %ascii;
my $vascii=join "|",keys %rascii;
$RE=~s#($vascii)#\\$rascii{$1}#g;

# print "Elispcode:\t $RE";
print "$RE";
#TODO whats the elisp syntax for \0 ???

The closest previous work on this have been extensions to Mx re-builder, see 以前最接近的工作是对Mx re-builder的扩展,请参阅

http://www.emacswiki.org/emacs/ReBuilder http://www.emacswiki.org/emacs/ReBuilder

or the work of Ye Wenbin on PDE. 还是叶文斌关于PDE的工作。

http://cpansearch.perl.org/src/YEWENBIN/Emacs-PDE-0.2.16/lisp/doc/pde.html http://cpansearch.perl.org/src/YEWENBIN/Emacs-PDE-0.2.16/lisp/doc/pde.html

可能相关的是visual-regexp-steroids ,它扩展了查询替换以使用实时预览,并允许您使用不同的正则表达式后端,包括PCRE。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM