简体   繁体   English

非正则表达式替代单词边界

[英]Non-regex alternative to word boundaries

I am currently writing a lexer using regular expressions as described in this post: Poor man's "lexer" for C# 我目前正在使用正则表达式编写词法分析器,如本文所描述: C#的穷人“词法分析器”

While it was much faster than what I already had, I just didn't like how things still took roughly 500ms per file (timed in a loop of 100x36k tokens with Stopwatch). 尽管它比我已经拥有的要快得多,但我只是不喜欢每个文件仍然花费大约500ms的时间(使用Stopwatch在100x36k令牌循环中计时)。

After moving around the precedence of my tokens, I cut the 500ms in half already and I gained an additional 50ms (roughly) by adding a "simple match" boolean to most of my tokens (which basically means it should use a simple string.Contains(Ordinal) rather than Regex.Match ). 在绕过令牌的优先级之后,我已经将500ms减少了一半,并且通过向大多数令牌中添加“简单匹配”布尔值(大致意味着它应该使用一个简单的string.Contains(Ordinal) )而获得了额外的50ms(大约)。 string.Contains(Ordinal) Regex.Match )而不是Regex.Match )。

For best performance, I obviously want to get rid of most, if not all Regex.Match calls. 为了获得最佳性能,我显然希望摆脱大多数(如果不是全部) Regex.Match调用。 For that to be possible, I need something which simulates the \\b tag in Regex, otherwise known as a word boundary (meaning it should only match the whole word). 为此,我需要在Regex中模拟\\b标记的东西,也称为单词边界(意味着它应该只匹配整个单词)。

While I can go wild and write a simple method which checks if the character before and after my "simple match" is a non-word character, I was wondering if .NET would have something for this built-in? 尽管我可以编写一种简单的方法来检查“简单匹配”之前和之后的字符是否是非单词字符,但是我想知道.NET是否会为此内置对象提供某些内容?

If I would end up having to write my own method, what would be the best approach? 如果最终不得不编写自己的方法,最好的方法是什么? Pick the index of the character after my word and check if it's byte value is lower than whatever? 在我的单词后面选择字符的索引,并检查它的字节值是否小于任何值? Any tips regarding this would also be welcome! 任何与此有关的提示也将受到欢迎!

Not sure why my initial question is being downvoted as to me it seems rather clear. 不确定为什么我最初的问题对我而言是低级的,这似乎很清楚。 I was not after getting my Regex' fixed as profiling showed that even the simplest Regex still takes more than I want. 修复正则表达式后,我并没有这么做,因为剖析表明,即使最简单的正则表达式仍然比我想要的花费更多。 It may be a poor mans lexer, but I still want it to perform as best as possible. 它可能是一个穷人的词法分析器,但我仍然希望它表现得最好。

The question, however, was if .NET had an alternative to word-boundaries built-in and if not, how I would go about implementing it myself WITHOUT using Regex. 但是,问题是,.NET是否可以替代内置的单词边界,如果没有,我将如何在不使用Regex的情况下自己实现它。

The answer to the first question appears to be No . 第一个问题的答案似乎是“ 否”

As for the second, I wrote an Extension-method for the char class: 至于第二个,我为char类编写了一种扩展方法:

public static bool IsWordCharacter(this char character)
{
    return (
        (character >= 'a' && character <= 'z') || 
        (character >= 'A' && character <= 'Z') || 
        (character >= '0' && character <= '9') || 
        character == '_');
}

According to most Regex documentation, this mimics the \\w flag (negating this method with ! results in \\W obviously), which in return is used in \\b , but without matching it in the result. 根据大多数正则表达式的文档,这模仿了\\w标志(否定此方法!结果\\W明显),这反过来在使用\\b ,但没有结果匹配它。

I then use this in a method something like this: 然后,在类似以下方法中使用此方法:

return 
    text.StartsWith(<needle>, StringComparison.Ordinal) 
    && !text[<length of needle>].IsWordCharacter()
        ? <length of needle> 
        : 0;

After which my underlying code knows if it has to use or drop the token. 之后,我的基础代码知道它是否必须使用或删除令牌。

Disclaimer : I'm aware it's not a full implementation of \\b , but it serves my purpose. 免责声明 :我知道它不是\\b的完整实现,但可以达到我的目的。

Also, after having converted all my Regex' in this way, I went from 250ms to a mere 50ms for exactly the same file. 同样,在以这种方式转换了我所有的Regex之后,对于完全相同的文件,我从250ms变为仅有50ms。 Lexing all the 110 script files I have to my possession takes less than a second in total, averaging to roughly 7ms per file. 我拥有的所有110个脚本文件总共花了不到一秒钟的时间,平均每个文件大约7毫秒。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM