简体   繁体   English

在Java中使用正则表达式使用不可打印的字符

[英]use regex in java with non printable chars

I'm using regex found here ( link ) to extract domain string that works fine. 我正在使用在这里找到的正则表达式( link )来提取正常工作的域字符串。

the regex is 正则表达式是

^((?!-)[A-Za-z0-9-]{1,63}(?<!-)\\.)+[A-Za-z]{2,6}$

I'm wondering, how could I change it in order to match domain which contains a non printable character instead of dot (.) ? 我想知道,如何更改它以匹配包含不可打印字符而不是点(。)的域?

I know that regex code are like \\x01, \\x02, etc.. but if I replace dot with one of them, the regex doesn't match anymore 我知道正则表达式代码类似于\\ x01,\\ x02等。但是,如果我用其中之一替换点,则正则表达式不再匹配

thanks in advance 提前致谢

Your dot is escaped here. 您的点在这里逃脱了。

You need to remove the double-escape ( \\\\ ) and replace the dot with a literal to match it. 您需要删除双转义符( \\\\ )并将点替换为文字以使其匹配。

You could also just remove the double escape and keep the dot, which would match any character. 您也可以删除双转义符并保留与任何字符匹配的点。

. will match any single character regardless of whether it is printable. 将匹配任何单个字符,无论它是否可打印。 Your current group [A-Za-z0-9-] restricts it. 您当前的群组[A-Za-z0-9-]对其进行了限制。 You could change this to "any character except literal dot"... ie [^.]. 您可以将其更改为“除文字点以外的任何字符” ...即[^。]。

Pattern regex = Pattern.compile("^((?!-)[^.]{1,63}(?<!-)\\.)+[A-Za-z]{2,6}$");
System.out.println(regex.matcher("\u0001\u0002\u0003\u0004..com").find()); // => false
System.out.println(regex.matcher("\u0001\u0002\u0003\u0004.com").find()); // => true
System.out.println(regex.matcher("google.com").find()); // => true

If you're attempting to validate user entry of IDNs (international domain names), note note that there are new gTLDs that contain non alphanumeric characters Example .شبكة (.network). 如果您尝试验证IDN(国际域名)的用户输入,请注意,有一些新的gTLD包含非字母数字字符,例如.شبكة(.network)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM