[英]Modelling a regular expression parser with polymorphism
So, I'm doing a regular expression parser for school that creates a hierarchy of objects in charge of the matching. 因此,我正在为学校做一个正则表达式解析器,它创建负责匹配的对象的层次结构。 I decided to do it object oriented because it's easier for me to imagine an implementation of the grammar that way.
我决定将其面向对象,因为这样可以使我更容易想象语法的实现。 So, these are my classes making up the regular expressions.
因此,这些是组成正则表达式的类。 It's all in Java, but I think you can follow along if you're proficient in any object oriented language.
所有这些都是用Java编写的,但是如果您精通任何面向对象的语言,我想您都可以遵循。
The only operators we're required to implement is Union (+), Kleene-Star (*), Concatenation of expressions (ab or maybe (a+b)c) and of course the Parenthesis as illustrated in the example of Concatination. 我们唯一需要实现的运算符是Union (+), Kleene-Star (*),表达式的级联(ab或(a + b)c),当然还有括号,如Concatination示例中所示。 This is what I've implemented right now and I've got it to work like a charm with a bit of overhead in the main.
这就是我现在已经实现的功能,并且使它像灵符一样工作,并且在主体上有一些开销。
The parent class, Regexp.java 父类Regexp.java
public abstract class Regexp {
//Print out the regular expression it's holding
//Used for debugging purposes
abstract public void print();
//Checks if the string matches the expression it's holding
abstract public Boolean match(String text);
//Adds a regular expression to be operated upon by the operators
abstract public void add(Regexp regexp);
/*
*To help the main with the overhead to help it decide which regexp will
*hold the other
*/
abstract public Boolean isEmpty();
}
There's the most simple regexp, Base.java, which holds a char and returns true if the string matches the char. 最简单的正则表达式Base.java包含一个字符,如果字符串与该字符匹配,则返回true。
public class Base extends Regexp{
char c;
public Base(char c){
this.c = c;
}
public Base(){
c = null;
}
@Override
public void print() {
System.out.println(c);
}
//If the string is the char, return true
@Override
public Boolean match(String text) {
if(text.length() > 1) return false;
return text.startsWith(""+c);
}
//Not utilized, since base is only contained and cannot contain
@Override
public void add(Regexp regexp) {
}
@Override
public Boolean isEmpty() {
return c == null;
}
}
A parenthesis, Paren.java, to hold a regexp inside it. 括号Paren.java,用于在其中容纳一个正则表达式。 Nothing really fancy here, but illustrates how matching works.
这里没什么好看的,但是说明了匹配是如何工作的。
public class Paren extends Regexp{
//Member variables: What it's holding and if it's holding something
private Regexp regexp;
Boolean empty;
//Parenthesis starts out empty
public Paren(){
empty = true;
}
//Unless you create it with something to hold
public Paren(Regexp regexp){
this.regexp = regexp;
empty = false;
}
//Print out what it's holding
@Override
public void print() {
regexp.print();
}
//Real simple; either what you're holding matches the string or it doesn't
@Override
public Boolean match(String text) {
return regexp.match(text);
}
//Pass something for it to hold, then it's not empty
@Override
public void add(Regexp regexp) {
this.regexp = regexp;
empty = false;
}
//Return if it's holding something
@Override
public Boolean isEmpty() {
return empty;
}
}
A Union.java, which is two regexps that can be matched. Union.java,这是两个可以匹配的正则表达式。 If one of them is matched, the whole Union is a match.
如果其中之一匹配,则整个联合就是匹配项。
public class Union extends Regexp{
//Members
Regexp lhs;
Regexp rhs;
//Indicating if there's room to push more stuff in
private Boolean lhsEmpty;
private Boolean rhsEmpty;
public Union(){
lhsEmpty = true;
rhsEmpty = true;
}
//Can start out with something on the left side
public Union(Regexp lhs){
this.lhs = lhs;
lhsEmpty = false;
rhsEmpty = true;
}
//Or with both members set
public Union(Regexp lhs, Regexp rhs) {
this.lhs = lhs;
this.rhs = rhs;
lhsEmpty = false;
rhsEmpty = false;
}
//Some stuff to help me see the unions format when I'm debugging
@Override
public void print() {
System.out.println("(");
lhs.print();
System.out.println("union");
rhs.print();
System.out.println(")");
}
//If the string matches the left side or right side, it's a match
@Override
public Boolean match(String text) {
if(lhs.match(text) || rhs.match(text)) return true;
return false;
}
/*
*If the left side is not set, add the member there first
*If not, and right side is empty, add the member there
*If they're both full, merge it with the right side
*(This is a consequence of left-to-right parsing)
*/
@Override
public void add(Regexp regexp) {
if(lhsEmpty){
lhs = regexp;
lhsEmpty = false;
}else if(rhsEmpty){
rhs = regexp;
rhsEmpty = false;
}else{
rhs.add(regexp);
}
}
//If it's not full, it's empty
@Override
public Boolean isEmpty() {
return (lhsEmpty || rhsEmpty);
}
}
A concatenation, Concat.java, which is basically a list of regexps chained together. 串联Concat.java,基本上是链接在一起的正则表达式的列表。 This one is complicated.
这个很复杂。
public class Concat extends Regexp{
/*
*The list of regexps is called product and the
*regexps inside called factors
*/
List<Regexp> product;
public Concat(){
product = new ArrayList<Regexp>();
}
public Concat(Regexp regexp){
product = new ArrayList<Regexp>();
pushRegexp(regexp);
}
public Concat(List<Regexp> product) {
this.product = product;
}
//Adding a new regexp pushes it into the list
public void pushRegexp(Regexp regexp){
product.add(regexp);
}
//Loops over and prints them
@Override
public void print() {
for(Regexp factor: product){
factor.print();
}
}
/*
*Builds up a substring approaching the input string.
*When it matches, it builds another substring from where it
*stopped. If the entire string has been pushed, it checks if
*there's an equal amount of matches and factors.
*/
@Override
public Boolean match(String text) {
ArrayList<Boolean> bools = new ArrayList<Boolean>();
int start = 0;
ListIterator<Regexp> itr = product.listIterator();
Regexp factor = itr.next();
for(int i = 0; i <= text.length(); i++){
String test = text.substring(start, i);
if(factor.match(test)){
start = i;
bools.add(true);
if(itr.hasNext())
factor = itr.next();
}
}
return (allTrue(bools) && (start == text.length()));
}
private Boolean allTrue(List<Boolean> bools){
return product.size() == bools.size();
}
@Override
public void add(Regexp regexp) {
pushRegexp(regexp);
}
@Override
public Boolean isEmpty() {
return product.isEmpty();
}
}
Again, I've gotten these to work to my satisfaction with my overhead, tokenization and all that good stuff. 再说一次,我已经使它们对我的开销,令牌化和所有这些好东西满意。 Now I want to introduce the Kleene-star operation.
现在,我要介绍Kleene-star操作。 It matches on any number, even 0, of occurrences in the text.
它与文本中出现的任何数量(甚至0)匹配。 So, ba* would match b, ba, baa, baaa and so on while (ba)* would match on ba, baba, bababa and so on.
因此,ba *将匹配b,ba,baa,baaa等,而(ba)*将匹配ba,baba,bababa等。 Does it even look possible to extend my Regexp to this or do you see another way of solving this?
是否有可能将我的Regexp扩展到这个范围,或者您看到解决这个问题的另一种方法?
PS: There's getters, setter and all kinds of other support functions that I didn't write out, but this is mainly for you to get the point quickly of how these classes works. PS:有一些我没有写过的getter,setter和其他各种支持函数,但这主要是让您快速了解这些类的工作原理。
You seem to be trying to use a fallback algorithm to do the parsing. 您似乎正在尝试使用后备算法进行解析。 That can work -- although it is easier to do with higher-order functions -- but it is far from the best way to parse regular expressions (by which I mean the things which are mathematically regular expressions, as opposed to the panoply of parsing languages implemented by "regular expression" libraries in various languages).
这可以工作-尽管更容易处理高阶函数-但它远不是解析正则表达式的最佳方法(我的意思是,数学上是正则表达式的东西,而不是全解析的东西)由“正则表达式”库以各种语言实现的语言)。
It's not the best way because the parsing time is not linear in the size of the string to be matched; 这不是最好的方法,因为解析时间与要匹配的字符串的大小不是线性的。 in fact, it can be exponential.
实际上,它可以是指数的。 But to understand that, it's important to understand why your current implementation has a problem.
但是要了解这一点,重要的是要了解您当前的实现为什么会有问题。
Consider the fairly simple regular expression (ab+a)(bb+a)
. 考虑相当简单的正则表达式
(ab+a)(bb+a)
。 That can match exactly four strings: abbb
, aba
, abb
, aa
. 可以恰好匹配四个字符串:
abbb
, aba
, abb
, aa
。 All of those strings start with a
, so your concatenation algorithm will match the first concatenand ( (ab+a)
) at position 1, and proceed to try the second concatenand ( bb+a
). 所有这些字符串都以
a
开头,因此您的串联算法将匹配位置1处的第一个串联( (ab+a)
),然后继续尝试第二个串联( bb+a
)。 That will successfully match abb
and aa
, but it will fail on aba
and abbb
. 那将成功地匹配
abb
和aa
,但是在aba
和abbb
上将失败。
Now, suppose you modified the concatenation function to select the longest matching substring rather than the shortest one. 现在,假设您修改了串联函数,以选择最长的匹配子字符串而不是最短的子字符串。 In that case, the first subexpression would match
ab
in three of the possible strings (all but aa
), and the match would fail in the case of abb
. 在这种情况下,第一个子表达式将与三个可能的字符串中的
ab
匹配(除aa
所有字符串),而对于abb
,匹配将失败。
In short, when you are matching a concatenation R·S
, you need to do something like this: 简而言之,当您匹配串联
R·S
,您需要执行以下操作:
R
R
匹配的初始字符串 S
matches the rest of the text S
与其余文本匹配 R
R
匹配的初始字符串 In the case of full regular expression matches, it doesn't matter which order we list matches for R
, but usually we're trying to find the longest substring which matches a regular expression, so it is convenient to enumerate the possible matches from longest to shortest. 对于完全正则表达式匹配,列出与
R
匹配的顺序并不重要,但通常我们会尝试查找与正则表达式匹配的最长子字符串,因此从最长的枚举中枚举可能的匹配很方便。最短。
Doing that means that we need to be able to restart a match after a downstream failure, to find the "next match". 这样做意味着我们需要能够在下游故障后重新开始比赛,以找到“下一个比赛”。 That's not terribly complicated, but it definitely complicates the interface, because all of the compound regular expression operators need to "pass through" the failure to their children in order to find the next alternative.
这并不是很复杂,但是它肯定会使接口复杂化,因为所有复合正则表达式运算符都需要将故障“传递”给子级以找到下一个替代方案。 That is, the operator
R+S
might first find something which matches R
. 也就是说,运算符
R+S
可能首先找到与R
匹配的东西。 If asked for the next possibility, it first has to ask R
if there is another string which it could match, before moving on to S
. 如果要求下一个可能性,则必须先询问
R
是否存在另一个可以匹配的字符串,然后再继续输入S
(And that's passing over the question of how to get +
to list the matches in order by length.) (这超出了如何获取
+
以按长度顺序列出匹配项的问题。)
With such an implementation, it's easy to see how to implement the Kleene star ( R *
), and it is also easy to see why it can take exponential time. 通过这样的实现,很容易看到如何实现Kleene star(
R *
),也很容易看到为什么要花费指数时间。 One possible implementation: 一种可能的实现:
R
as possible. R
R
for another match R
进行其他比赛 R
from the list, and ask what is now the last R
for another match R
,然后询问另一场比赛的最后一个R
(This can be simplified with recursion: Match an R
, then match an R *
. For the next match, first try the next R *
; failing that try the next R
and the first following R *
; when all else fails, try the empty string.) (可以通过递归来简化此过程:匹配一个
R
,然后匹配一个R *
。对于下一个匹配,请首先尝试下一个R *
;如果失败,请尝试下一个R
和后面的第一个R *
;当所有其他失败时,请尝试空字符串。)
Implementing that is an interesting programming exercise, so I encourage you to continue. 实现那是一个有趣的编程练习,所以我鼓励您继续。 But be aware that there are better algorithms.
但是请注意,有更好的算法。 You might want to read Russ Cox's interesting essays on regular expression matching .
您可能想阅读Russ Cox关于正则表达式匹配的有趣文章 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.