简体   繁体   English

何时值得在Java中使用RegEx?

[英]When would it be worth using RegEx in Java?

I'm writing a small app that reads some input and do something based on that input. 我正在写一个小应用程序,它读取一些输入并根据该输入做一些事情。

Currently I'm looking for a line that ends with, say, "magic", I would use String's endsWith method. 目前我正在寻找以“魔术”结尾的行,我会使用String的endsWith方法。 It's pretty clear to whoever reads my code what's going on. 对于那些正在阅读我的代码的人来说,这是很清楚的。

Another way to do it is create a Pattern and try to match a line that ends with "magic". 另一种方法是创建一个Pattern并尝试匹配以“magic”结尾的行。 This is also clear, but I personally think this is an overkill because the pattern I'm looking for is not complex at all. 这也很清楚,但我个人认为这是一种矫枉过正,因为我正在寻找的模式并不复杂。

When do you think it's worth using RegEx Java? 你认为什么时候使用RegEx Java值得? If it's complexity, how would you personally define what's complex enough? 如果它的复杂性,你会如何个人定义什么是复杂的?

Also, are there times when using Patterns are actually faster than string manipulation? 此外,是否有时候使用模式实际上比字符串操作更快?

EDIT: I'm using Java 6. 编辑:我正在使用Java 6。

Basically: if there is a non-regex operation that does what you want in one step, always go for that. 基本上:如果有一个非正则表达式操作可以一步完成你想要的操作,那么总是这样做。

This is not so much about performance, but about a) readability and b) compile-time-safety. 这不是关于性能,而是关于a)可读性和b)编译时安全性。 Specialized non-regex versions are usually a lot easier to read than regex-versions. 专用的非正则表达式版本通常比正则表达式版本更容易阅读。 And a typo in one of these specialized methods will not compile, while a typo in a Regex will fail miserably at runtime. 并且其中一个专门方法中的拼写错误将无法编译,而正则表达式中的拼写错误将在运行时失败。

Comparing Regex-based solutions to non-Regex-bases solutions 将基于Regex的解决方案与非Regex-base解决方案进行比较

String s = "Magic_Carpet_Ride";

s.startsWith("Magic");   // non-regex
s.matches("Magic.*");    // regex

s.contains("Carpet");    // non-regex
s.matches(".*Carpet.*"); // regex

s.endsWith("Ride");      // non-regex
s.matches(".*Ride");     // regex

In all these cases it's a No-brainer: use the non-regex version. 在所有这些情况下,这是一个明智的选择:使用非正则表达式版本。

But when things get a bit more complicated, it depends. 但是当事情变得更复杂时,它取决于。 I guess I'd still stick with non-regex in the following case, but many wouldn't: 我猜我在以下情况下仍会坚持使用非正则表达式,但很多人不会:

// Test whether a string ends with "magic" in any case,
// followed by optional white space
s.toLowerCase().trim().endsWith("magic"); // non-regex, 3 calls
s.matches(".*(?i:magic)\\s*");            // regex, 1 call, but ugly

And in response to RegexesCanCertainlyBeEasierToReadThanMultipleFunctionCallsToDoTheSameThing : 并回应RegexesCanCertainlyBeEasierToReadThanMultipleFunctionCallsToDoTheSameThing

I still think the non-regex version is more readable, but I would write it like this: 我仍然认为非正则表达式版本更具可读性,但我会这样写:

s.toLowerCase()
 .trim()
 .endsWith("magic");

Makes the whole difference, doesn't it? 完全不同,不是吗?

You would use Regex when the normal manipulations on the String class are not enough to elegantly get what you need from the String. 当对String类的正常操作不足以从String中优雅地获得所需内容时,您将使用Regex。

A good indicator that this is the case is when you start splitting, then splitting those results, then splitting those results. 这种情况的一个很好的指标是,当您开始拆分,然后拆分这些结果,然后拆分这些结果。 The code is getting unwieldy. 代码变得笨拙。 Two lines of Pattern/Regex code can clean this up, neatly wrapped in a method that is unit tested.... 两行Pattern / Regex代码可以清理它,整齐地包裹在一个单元测试的方法中....

Anything that can be done with regex can also be hand-coded. 任何可以使用正则表达式完成的操作也可以手动编码。

Use regex if: 使用正则表达式:

  1. Doing it manually is going to take more effort without much benefit. 手动完成它将需要更多的努力而没有太多的好处。
  2. You can easily come up with a regex for your task. 您可以轻松地为您的任务提出正则表达式。

Don't use regex if: 如果符合以下条件, 请勿使用正则表

  1. It's very easy to do it otherwise, as in your example. 这样做很容易,就像你的例子一样。
  2. The string you're parsing does not lend itself to regex. 您正在解析的字符串不适合正则表达式。 (it is customary to link to this question ) (习惯上链接到这个问题

I think you are best with using endsWith . 我认为你最好使用endsWith Unless your requirements change, it's simpler and easier to understand. 除非您的要求发生变化,否则更简单易懂。 Might perform faster too. 可能也会表现得更快。

If there was a bit more complexity, such as you wanted to match "magic", "majik', but not "Magic" or "Majik"; or you wanted to match "magic" followed by a space and then 1 word such as "... magic spoon" but not "...magic soup spoon", then I think RegEx would be a better way to go. 如果有更多的复杂性,例如你想要匹配“魔法”,“majik”,但不是“Magic”或“Majik”;或者你想匹配“魔法”后跟一个空格然后1个单词如“......魔术勺”但不是“......魔术汤匙”,那么我认为RegEx将是一个更好的方式。

Any complex parsing where you are generating a lot of Objects would be better done with RegEx when you factor in both computing power, and brainpower it takes to generate the code for that purpose. 当您考虑计算能力和为此目的生成代码所需的智能时,使用RegEx可以更好地完成生成大量对象的任何复杂解析。 If you have a RegEx guru handy, it's almost always worthwhile as the patterns can easily be tweaked to accommodate for business rule changes without major loop refactoring which would likely be needed if you used pure java to do some of the complex things RegEx does. 如果您有一个方便的RegEx专家,它几乎总是值得的,因为模式可以很容易地调整以适应业务规则更改而无需重大循环重构,如果您使用纯java来执行RegEx所做的一些复杂事情,则可能需要这些重构。

There's a saying that goes: 有一种说法:

Some people, when confronted with a problem, think "I know, I'll use regular expressions." 有些人在面对问题时会想“我知道,我会使用正则表达式”。 Now they have two problems . 现在他们有两个问题 ( link ). 链接 )。

For a simple test, I'd proceed exactly like you've done. 对于一个简单的测试,我会像你一样完成。 If you find that it's getting more complicated, then I'd consider Regular Expressions only if there isn't another way. 如果你发现它变得越来越复杂,那么只有在没有其他方法时才会考虑正则表达式。

If your basic line ending is the same everytime, such as with "magic", then you are better of using endsWith. 如果您的基本行结尾每次都相同,例如“魔术”,那么您最好使用endsWith。

However, if you have a line that has the same base, but can have multiple values, such as: 但是,如果您的行具有相同的基数,但可以具有多个值,例如:

<string> <number> <string> <string> <number>

where the strings and numbers can be anything, you're better of using RegEx. 字符串和数字可以是任何东西,你最好使用RegEx。

Your lines are always ending with a string, but you don't know what that string is. 你的行总是以字符串结尾,但你不知道那个字符串是什么。

If it's as simple as endsWith, startsWith or contains, then you should use these functions. 如果它像endsWith,startsWith或contains一样简单,那么你应该使用这些函数。 If you are processing more "complex" strings and you want to extract information from these strings, then regexp/matchers can be used. 如果您正在处理更多“复杂”字符串并且想要从这些字符串中提取信息,则可以使用regexp / matchers。

If you have something like "commandToRetrieve someNumericArgs someStringArgs someOptionalArgs" then regexp will ease your task a lot :) 如果您有类似“commandToRetrieve someNumericArgs someStringArgs someOptionalArgs”之类的东西,那么regexp将大大减轻您的任务:)

I'd never use regexes in java if I have an easier way to do it, like in this case the endsWith method. 如果我有一个更简单的方法,我永远不会在java中使用正则表达式,就像在这种情况下的endsWith方法。 Regexes in java are as ugly as they get, probably with the only exception of the match method on String . java中的正则表达式一样丑陋,可能除了String上的match方法之外。

Usually avoiding regexes makes your core more readable and easier for other programmers. 通常,避免使用正则表达式可以使您的核心对其他程序员更具可读性和易用性。 The opposite is true, complex regexes might confuse even the most experience hackers out there. 反之亦然,复杂的正则表达甚至可能会让那些经验最丰富的黑客感到困惑。

As for performance concerns: just profile. 至于性能问题:只是简介。 Specially in java. 特别是在java中。

I would suggest using a regular expression when you know the format of an input but you are not necessarily sure on the value (or possible value(s)) of the formatted input. 当您知道输入的格式但我不一定确定格式化输入的 (或可能的值)时,我建议使用正则表达式。

What I'm saying, if you have an input all ending with, in your case, "magic" then String.endsWith() works fine (seeing you know that your possible input value will end with "magic"). 我所说的,如果你的输入都是以“魔术”结尾,那么String.endsWith()工作正常(看到你知道你的可能输入值将以“魔法”结束)。

If you have a format eg a RFC 5322 message format , one cannot clearly say that all email address can end with a .com , hence you can create a regular expression that conforms to the RFC 5322 standard for verification. 如果您的格式为RFC 5322格式 ,则无法清楚地说明所有电子邮件地址都以.com结尾,因此您可以创建符合RFC 5322标准的正则表达式进行验证。

In a nutshell, if you know a format structure of your input data but don't know exactly what values (or possible values) you can receive, use regular expressions for validation. 简而言之,如果您知道输入数据的格式结构但不确切知道可以接收的值(或可能的值),请使用正则表达式进行验证。

If you are familiar with how regexp works you will soon find that a lot of problems are easily solved by using regexp. 如果您熟悉regexp的工作原理,您很快就会发现使用regexp可以轻松解决许多问题。

Personally I look to using java String operations if that is easy, but if you start splitting strings and doing substring on those again, I'd start thinking in regular expressions. 我个人认为使用java String操作,如果这很容易,但如果你开始拆分字符串并再次对它们进行子串,我会开始考虑正则表达式。

And again, if you use regular expressions, why stop at lines. 而且,如果你使用正则表达式,为什么要停在线上。 By configuring your regexp you can easily read entire files in one regular expression (Pattern.DOTALL as parameter to the Pattern.compile and your regexp don't end in the newlines). 通过配置正则表达式,您可以轻松地在一个正则表达式中读取整个文件(Pattern.DOTALL作为Pattern.compile的参数,并且您的正则表达式不会在换行符中结束)。 I'd combine this with Apache Commons IOUtils.toString() methods and you got something very powerful to do quick stuff with. 我将它与Apache Commons IOUtils.toString()方法相结合,你可以得到一些非常强大的功能来快速完成。

I would even bring out a regular expression to parse some xml if needed. 如果需要,我甚至会带出一个正则表达式来解析一些xml。 (For instance in a unit test, where I want to check that some elements are present in the xml). (例如在单元测试中,我想检查xml中是否存在某些元素)。

For instance, from some unit test of mine: 例如,从我的一些单元测试中:

Pattern pattern = Pattern.compile(
                "<Monitor caption=\"(.+?)\".*?category=\"(.+?)\".*?>"
                + ".*?<Summary.*?>.+?</Summary>"
                + ".*?<Configuration.*?>(.+?)</Configuration>"
                + ".*?<CfgData.*?>(.+?)</CfgData>", Pattern.DOTALL);

which will match all segments in this xml and pick out some segments that I want to do some sub matching on. 这将匹配此xml中的所有段,并选择我想要进行一些子匹配的一些段。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM