[英]Elisp mechanism for converting PCRE regexps to emacs regexps
如果没有其他原因,当我输入'('我几乎总是想要一个分组操作符时,我承认喜欢PCRE正则表达式比使用emacs更好的偏见。当然,\\ w和类似的东西比使用它更方便其他等价物。
但是,当然,期望改变emacs的内部结构会很疯狂。 但我认为应该可以从PCRE表达式转换为emacs表达式,并执行所有需要的转换,以便我可以编写:
(defun my-super-regexp-function ...
(search-forward (pcre-convert "__\\w: \d+")))
(或类似的)。
有人知道可以做到这一点的elisp库吗?
编辑:从以下答案中选择回复...
哇,我喜欢从4天的休假回来,找到一系列有趣的答案! 我喜欢这两种解决方案的工作。
最后,看起来exec-a-script和直接elisp版本的解决方案都可以工作,但从纯粹的速度和“正确性”方法来看,elisp版本肯定是人们更喜欢的版本(包括我自己) 。
https://github.com/joddie/pcre2el是这个答案的最新版本。
pcre2el
或rxt
(RegeXp Translator或RegeXp Tools)是一个用于在Emacs中处理正则表达式的实用程序,它基于regexp语法的递归下降解析器。 除了将(一部分)PCRE语法转换为其Emacs等效语言之外,它还可以执行以下操作:
- 将Emacs语法转换为PCRE
- 将语法转换为
rx
,一种基于S表达式的regexp语法- 通过以
rx
形式显示解析树并突出显示相应的代码块来解开复杂的正则表达式- 显示匹配正则表达式的完整字符串(产品)列表,前提是列表是有限的
- 提供regexp语法的实时字体锁定(到目前为止仅适用于Elisp缓冲区 - TODO列表中的其他模式)
原始答案的内容如下......
这是一个快速而丑陋的Emacs lisp解决方案 (编辑:现在更加永久地定位在这里 )。 它主要基于pcrepattern
手册页中的描述,并且通过令牌工作令牌,仅转换以下构造:
( .. )
|
{M,N}
\\Q .. \\E
\\a
, \\c
, \\e
, \\f
, \\n
, \\r
, \\t
, \\x
和\\
+八进制数字 \\d
, \\D
, \\h
, \\H
, \\s
, \\S
, \\v
, \\V
\\w
和\\W
保持不变(使用Emacs自己的单词和非单词字符的想法) 它不会对更复杂的PCRE断言做任何事情,但它会尝试在字符类中转换转义。 在包括类似\\D
的字符类的情况下,这通过转换为具有交替的非捕获组来完成。
它通过了我为它编写的测试,但肯定存在错误,而且逐个令牌扫描的方法可能很慢。 换句话说,没有保修。 但也许它会为某些目的做足够多的工作。 欢迎有兴趣的人士改进;-)
(eval-when-compile (require 'cl))
(defvar pcre-horizontal-whitespace-chars
(mapconcat 'char-to-string
'(#x0009 #x0020 #x00A0 #x1680 #x180E #x2000 #x2001 #x2002 #x2003
#x2004 #x2005 #x2006 #x2007 #x2008 #x2009 #x200A #x202F
#x205F #x3000)
""))
(defvar pcre-vertical-whitespace-chars
(mapconcat 'char-to-string
'(#x000A #x000B #x000C #x000D #x0085 #x2028 #x2029) ""))
(defvar pcre-whitespace-chars
(mapconcat 'char-to-string '(9 10 12 13 32) ""))
(defvar pcre-horizontal-whitespace
(concat "[" pcre-horizontal-whitespace-chars "]"))
(defvar pcre-non-horizontal-whitespace
(concat "[^" pcre-horizontal-whitespace-chars "]"))
(defvar pcre-vertical-whitespace
(concat "[" pcre-vertical-whitespace-chars "]"))
(defvar pcre-non-vertical-whitespace
(concat "[^" pcre-vertical-whitespace-chars "]"))
(defvar pcre-whitespace (concat "[" pcre-whitespace-chars "]"))
(defvar pcre-non-whitespace (concat "[^" pcre-whitespace-chars "]"))
(eval-when-compile
(defmacro pcre-token-case (&rest cases)
"Consume a token at point and evaluate corresponding forms.
CASES is a list of `cond'-like clauses, (REGEXP FORMS
...). Considering CASES in order, if the text at point matches
REGEXP then moves point over the matched string and returns the
value of FORMS. Returns `nil' if none of the CASES matches."
(declare (debug (&rest (sexp &rest form))))
`(cond
,@(mapcar
(lambda (case)
(let ((token (car case))
(action (cdr case)))
`((looking-at ,token)
(goto-char (match-end 0))
,@action)))
cases)
(t nil))))
(defun pcre-to-elisp (pcre)
"Convert PCRE, a regexp in PCRE notation, into Elisp string form."
(with-temp-buffer
(insert pcre)
(goto-char (point-min))
(let ((capture-count 0) (accum '())
(case-fold-search nil))
(while (not (eobp))
(let ((translated
(or
;; Handle tokens that are treated the same in
;; character classes
(pcre-re-or-class-token-to-elisp)
;; Other tokens
(pcre-token-case
("|" "\\|")
("(" (incf capture-count) "\\(")
(")" "\\)")
("{" "\\{")
("}" "\\}")
;; Character class
("\\[" (pcre-char-class-to-elisp))
;; Backslash + digits => backreference or octal char?
("\\\\\\([0-9]+\\)"
(let* ((digits (match-string 1))
(dec (string-to-number digits)))
;; from "man pcrepattern": If the number is
;; less than 10, or if there have been at
;; least that many previous capturing left
;; parentheses in the expression, the entire
;; sequence is taken as a back reference.
(cond ((< dec 10) (concat "\\" digits))
((>= capture-count dec)
(error "backreference \\%s can't be used in Emacs regexps"
digits))
(t
;; from "man pcrepattern": if the
;; decimal number is greater than 9 and
;; there have not been that many
;; capturing subpatterns, PCRE re-reads
;; up to three octal digits following
;; the backslash, and uses them to
;; generate a data character. Any
;; subsequent digits stand for
;; themselves.
(goto-char (match-beginning 1))
(re-search-forward "[0-7]\\{0,3\\}")
(char-to-string (string-to-number (match-string 0) 8))))))
;; Regexp quoting.
("\\\\Q"
(let ((beginning (point)))
(search-forward "\\E")
(regexp-quote (buffer-substring beginning (match-beginning 0)))))
;; Various character classes
("\\\\d" "[0-9]")
("\\\\D" "[^0-9]")
("\\\\h" pcre-horizontal-whitespace)
("\\\\H" pcre-non-horizontal-whitespace)
("\\\\s" pcre-whitespace)
("\\\\S" pcre-non-whitespace)
("\\\\v" pcre-vertical-whitespace)
("\\\\V" pcre-non-vertical-whitespace)
;; Use Emacs' native notion of word characters
("\\\\[Ww]" (match-string 0))
;; Any other escaped character
("\\\\\\(.\\)" (regexp-quote (match-string 1)))
;; Any normal character
("." (match-string 0))))))
(push translated accum)))
(apply 'concat (reverse accum)))))
(defun pcre-re-or-class-token-to-elisp ()
"Consume the PCRE token at point and return its Elisp equivalent.
Handles only tokens which have the same meaning in character
classes as outside them."
(pcre-token-case
("\\\\a" (char-to-string #x07)) ; bell
("\\\\c\\(.\\)" ; control character
(char-to-string
(- (string-to-char (upcase (match-string 1))) 64)))
("\\\\e" (char-to-string #x1b)) ; escape
("\\\\f" (char-to-string #x0c)) ; formfeed
("\\\\n" (char-to-string #x0a)) ; linefeed
("\\\\r" (char-to-string #x0d)) ; carriage return
("\\\\t" (char-to-string #x09)) ; tab
("\\\\x\\([A-Za-z0-9]\\{2\\}\\)"
(char-to-string (string-to-number (match-string 1) 16)))
("\\\\x{\\([A-Za-z0-9]*\\)}"
(char-to-string (string-to-number (match-string 1) 16)))))
(defun pcre-char-class-to-elisp ()
"Consume the remaining PCRE character class at point and return its Elisp equivalent.
Point should be after the opening \"[\" when this is called, and
will be just after the closing \"]\" when it returns."
(let ((accum '("["))
(pcre-char-class-alternatives '())
(negated nil))
(when (looking-at "\\^")
(setq negated t)
(push "^" accum)
(forward-char))
(when (looking-at "\\]") (push "]" accum) (forward-char))
(while (not (looking-at "\\]"))
(let ((translated
(or
(pcre-re-or-class-token-to-elisp)
(pcre-token-case
;; Backslash + digits => always an octal char
("\\\\\\([0-7]\\{1,3\\}\\)"
(char-to-string (string-to-number (match-string 1) 8)))
;; Various character classes. To implement negative char classes,
;; we cons them onto the list `pcre-char-class-alternatives' and
;; transform the char class into a shy group with alternation
("\\\\d" "0-9")
("\\\\D" (push (if negated "[0-9]" "[^0-9]")
pcre-char-class-alternatives) "")
("\\\\h" pcre-horizontal-whitespace-chars)
("\\\\H" (push (if negated
pcre-horizontal-whitespace
pcre-non-horizontal-whitespace)
pcre-char-class-alternatives) "")
("\\\\s" pcre-whitespace-chars)
("\\\\S" (push (if negated
pcre-whitespace
pcre-non-whitespace)
pcre-char-class-alternatives) "")
("\\\\v" pcre-vertical-whitespace-chars)
("\\\\V" (push (if negated
pcre-vertical-whitespace
pcre-non-vertical-whitespace)
pcre-char-class-alternatives) "")
("\\\\w" (push (if negated "\\W" "\\w")
pcre-char-class-alternatives) "")
("\\\\W" (push (if negated "\\w" "\\W")
pcre-char-class-alternatives) "")
;; Leave POSIX syntax unchanged
("\\[:[a-z]*:\\]" (match-string 0))
;; Ignore other escapes
("\\\\\\(.\\)" (match-string 0))
;; Copy everything else
("." (match-string 0))))))
(push translated accum)))
(push "]" accum)
(forward-char)
(let ((class
(apply 'concat (reverse accum))))
(when (or (equal class "[]")
(equal class "[^]"))
(setq class ""))
(if (not pcre-char-class-alternatives)
class
(concat "\\(?:"
class "\\|"
(mapconcat 'identity
pcre-char-class-alternatives
"\\|")
"\\)")))))
我对perlmonks上的perl脚本做了一些小修改(从命令行获取值)并将其保存为re_pl2el.pl
(如下所示)。 然后,以下是将PCRE转换为elisp正则表达式的不错的工作,至少对于我测试过的非外来情况。
(defun pcre-to-elre (regex)
(interactive "MPCRE expression: ")
(shell-command-to-string (concat "re_pl2el.pl -i -n "
(shell-quote-argument regex))))
(pcre-to-elre "__\\w: \\d+") ;-> "__[[:word:]]: [[:digit:]]+"
它没有像perl的害羞{N,M}?
那样处理一些“角落”案件{N,M}?
构造,当然不是代码执行等,但它可能满足您的需求或是一个良好的起点。 因为你喜欢PCRE我认为你知道足够的perl来修复你经常使用的任何情况。 如果不让我知道,我们可以解决它们。
我会更乐意使用一个脚本将正则表达式解析为AST,然后以elisp格式将其吐出(从那以后它也可以用rx
格式吐出来),但我找不到任何东西这样做,看起来像我应该在撰写论文时做很多工作。 :-)我发现很难相信没有人这样做过。
下面是我的“改进版”re_pl2el.pl。 -i
表示不对字符串进行双重转义, -n
表示不打印最终换行符。
#! /usr/bin/perl
#
# File: re_pl2el.pl
# Modified from http://perlmonks.org/?node_id=796020
#
# Description:
#
use strict;
use warnings;
# version 0.4
# TODO
# * wrap converter to function
# * testsuite
#--- flags
my $flag_interactive; # true => no extra escaping of backslashes
if ( int(@ARGV) >= 1 and $ARGV[0] eq '-i' ) {
$flag_interactive = 1;
shift @ARGV;
}
if ( int(@ARGV) >= 1 and $ARGV[0] eq '-n' ) {
shift @ARGV;
} else {
$\="\n";
}
if ( int(@ARGV) < 1 ) {
print "usage: $0 [-i] [-n] REGEX";
exit;
}
my $RE='\w*(a|b|c)\d\(';
$RE='\d{2,3}';
$RE='"(.*?)"';
$RE="\0".'\"\t(.*?)"';
$RE=$ARGV[0];
# print "Perlcode:\t $RE";
#--- encode all \0 chars as escape sequence
$RE=~s#\0#\\0#g;
#--- substitute pairs of backslashes with \0
$RE=~s#\\\\#\0#g;
#--- hide escape sequences of \t,\n,... with
# corresponding ascii code
my %ascii=(
t =>"\t",
n=> "\n"
);
my $kascii=join "|",keys %ascii;
$RE=~s#\\($kascii)#$ascii{$1}#g;
#--- normalize needless escaping
# e.g. from /\"/ to /"/, since it's no difference in perl
# but might confuse elisp
$RE=~s#\\"#"#g;
#--- toggle escaping of 'backslash constructs'
my $bsc='(){}|';
$RE=~s#[$bsc]#\\$&#g; # escape them once
$RE=~s#\\\\##g; # and erase double-escaping
#--- replace character classes
my %charclass=(
w => 'word' , # TODO: emacs22 already knows \w ???
d => 'digit',
s => 'space'
);
my $kc=join "|",keys %charclass;
$RE=~s#\\($kc)#[[:$charclass{$1}:]]#g;
#--- unhide pairs of backslashes
$RE=~s#\0#\\\\#g;
#--- escaping for elisp string
unless ($flag_interactive){
$RE=~s#\\#\\\\#g; # ... backslashes
$RE=~s#"#\\"#g; # ... quotes
}
#--- unhide escape sequences of \t,\n,...
my %rascii= reverse %ascii;
my $vascii=join "|",keys %rascii;
$RE=~s#($vascii)#\\$rascii{$1}#g;
# print "Elispcode:\t $RE";
print "$RE";
#TODO whats the elisp syntax for \0 ???
以前最接近的工作是对Mx re-builder的扩展,请参阅
http://www.emacswiki.org/emacs/ReBuilder
还是叶文斌关于PDE的工作。
http://cpansearch.perl.org/src/YEWENBIN/Emacs-PDE-0.2.16/lisp/doc/pde.html
可能相关的是visual-regexp-steroids ,它扩展了查询替换以使用实时预览,并允许您使用不同的正则表达式后端,包括PCRE。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.