简体   繁体   English

如何检查字符串是否只包含指定的字符集?

[英]How to check if string contains only specified character set?

I'm working on string and I wonder which way is best to check if string contains only specified character set: 我正在研究字符串,我想知道哪种方法最好检查字符串是否只包含指定的字符集:

@  ∆  SP  0  ¡  P  ¿  p 
£  _  !  1  A  Q  a  q 
$  Φ  "  2  B  R  b  r 
¥  Γ  #  3  C  S  c  s 
è  Λ  ¤  4  D  T  d  t 
é  O  %  5  E  U  e  u 
ù  Π  &  6  F  V  f  v 
ì  Ψ  '  7  G  W  g  w 
ò  Σ  (  8  H  X  h  x 
Ç  Θ  )  9  I  Y  i  y 
LF  Ξ  *  :  J  Z  j  z 
Ø  1)  +  ;  K  Ä  k  ä 
ø  Æ  ,  <  L  Ö  l  ö 
CR  æ  q  =  M  Ñ  m  ñ 
Å  ß  .  >  N  Ü  n  ü 
å  É  /  ?  O  §  o  à 

I was trying to make it done by eregi and regexp, but didn't success. 我试图通过eregi和regexp完成它,但没有成功。 Other way is to convert each char to decimal and check if it is smaller than < 137, or check each element by in_array() - which I find weak. 另一种方法是将每个char转换为十进制并检查它是否小于<137,或者通过in_array()检查每个元素 - 我觉得它很弱。

Anyone have better solution? 谁有更好的解决方案?

Thanks in advance. 提前致谢。

I see you've already accepted another answer, but I want to explain why your attempts with regex weren't working. 我看到你已经接受了另一个答案,但我想解释为什么你的正则表达式的尝试不起作用。 Hopefully it'll help you. 希望它能帮到你。

Firstly, I notice in your tags for this question. 首先,我在你的标签中注意到了这个问题的 Please note that PHP's ereg_ functions have been deprecated; 请注意,PHP的ereg_函数已被弃用; you should only use the preg_ functions. 你应该只使用preg_函数。

Now, if you want to use regex for this sort of thing, you would typically use a negated character class to define a list of characters you want to allow, and then look for anything else. 现在,如果要将regex用于此类事情,通常会使用否定字符类来定义要允许的字符列表,然后查找其他任何内容。

A character class is a list of characters enclosed in square brackets. 字符类是用方括号括起来的字符列表。 You can negate a character class by adding a carat symbol to the start of it. 您可以通过在其开头添加克拉符号来否定字符类。 So if you wanted a string that contained only 'A', 'B' or 'C', and you wanted to get warned about strings which contained anything else, you could use something like this: 因此,如果您想要一个仅包含“A”,“B”或“C”的字符串,并且您希望收到包含其他任何内容的字符串的警告,您可以使用以下内容:

$result = preg_match("/[^ABC]/",$mystring);

Your example is basically the same (but with more characters to test, obviously), except for two points: Firstly you have characters in your list which are reserved characters in Regex, and secondly, you are using non-Ascii characters. 您的示例基本相同(但显然需要更多字符进行测试),除了两点:首先,列表中的字符是Regex中的保留字符,其次,您使用的是非Ascii字符。

The Regex reserved characters can be dealt with by escaping them with a leading back-slash. 可以通过使用前导反斜杠转义它们来处理正则表达式保留字符。 You just need to know what characters are reserved. 您只需要知道保留了哪些字符。 Looking at your list, I see ? 看看你的清单,我明白了? , / , . / , . and + . +

The second point explains why you couldn't get it working with ereg , because the ereg functions don't support unicode. 第二点解释了为什么你无法使用ereg ,因为ereg函数不支持unicode。 Switch to using the preg functions instead, and you'll have more luck. 切换到使用preg功能,你会有更多的运气。

You still need to specify to the regex engine that you're looking for a unicode characters. 您仍然需要为正则表达式引擎指定您正在寻找unicode字符。 This is done by adding the u modifier to the end of the regex string. 这是通过将u修饰符添加到正则表达式字符串的末尾来完成的。

So a shortened version of your query might look like this: 因此,查询的缩短版本可能如下所示:

$result = preg_match("/[^èΛ¤4DTdt]/u",$mystring);

It looks like you're including new lines in your list of characters, so you may also want to add the multi-line modifier m alongside that u . 它看起来像你,包括你的人物的名单新的生产线,所以您可能还需要添加多行修饰符m旁边那个u

For characters which can't be written (or indeed for any character, if it's easier), you can add escape sequences for their unicode character codes. 对于无法写入的字符(或者对于任何字符,如果更容易),您可以为其unicode字符代码添加转义序列。 Use \￿ where FFFF is the hex unicode reference for the character you want to match -- eg matches à . 使用\￿ ,其中FFFF是您要匹配的字符的十六进制unicode引用 - 例如匹配à

I hope that gives you a better insight into regular expressions. 我希望这能让您更好地了解正则表达式。 I should add that I'm not saying that regex is necessarily the best solution to this question, nor necessarily the only solution. 我应该补充一点,我并不是说正则表达式必然是这个问题的最佳解决方案,也不一定是唯一的解决方案。 I have tried to make it perform optimally by using the negated character class (which means it'll fail as soon as it finds a non-matching character, and should prevent the kind of excessive backtracking which can cause regex expressions to be quite slow sometimes), so it should be reasonably performant, but I haven't tested it against other solutions. 我试图通过使用否定的字符类使它最佳地执行(这意味着它一旦找到不匹配的字符就会失败,并且应该防止那种过度的回溯,这可能导致正则表达式有时很慢),所以它应该是合理的性能,但我没有测试它与其他解决方案。

I hope that helps. 我希望有所帮助。

As far as you're concerned for single byte charsets, you can do it with string function: 至于你关注单字节字符集,你可以使用字符串函数:

$charset = 'abc';
$test = 'abcd';
$ofCharset = strlen($test) === strspn($test, $charset); # FALSE

Otherwise you must split your string into array entries of one char each and then compare against a character table which could be a keyed array as well containing the character of the charset as key. 否则,您必须将字符串拆分为每个char的数组条目,然后与字符表进行比较,该字符表可以是键控数组,也包含charset的字符作为键。

To keep the operation O(n) you could compute the ascii value of each of your test characters and place them into a hash table like so: 要保持操作O(n),您可以计算每个测试字符的ascii值,并将它们放入哈希表中,如下所示:

$testChars[$ascii] = true; $ testChars [$ ascii] = true;

Then just loop through the subject string's characters and test if the hash table value entry is set and equates to true. 然后循环遍历主题字符串的字符并测试是否设置了哈希表值条目并等于true。 If you get false for any of the characters then it contains characters not in your test set. 如果您对任何字符都是假的,那么它包含不在您的测试集中的字符。

This would be better than using in_array because testing if $testChars[$ascii] == true is a constant O(1) lookup. 这比使用in_array更好,因为测试$ testChars [$ ascii] == true是否为常量O(1)查找。

I know this is an old question, but no one has mentioned strpbrk . 我知道这是一个老问题,但没有人提到过strpbrk I've never tried it with odd characters, but aside from that possibly being an issue, why wouldn't this work? 我从来没有尝试过奇怪的角色,但除了这可能是一个问题,为什么这不起作用?

Here's a great resource that might help you find your answer. 这是一个很好的资源,可以帮助您找到答案。

Advanced Regular Expression Tips and Techniques 高级正则表达技巧和技巧

if your trying to find out only if there are other characters you could just str_replace the character set to "" and then get the strlen ... If it is 0 then only those characters are there... if greater then 0 then other characters exist. 如果你试图找出只有其他字符你可以str_replace字符集为“”然后得到strlen ...如果它是0然后只有那些字符...如果大于0然后其他字符存在。

ex. 恩。

$mystr = "macguffin";
$mycharset = array('m', 'a', 'c', 'g', 'u', 'f', 'i', 'n');

$tmpstr = str_replace($mycharset, "", $mystr);

if (!strlen($tmpstr)) {
    echo "only charset chars";
} else {
    echo "other chars";
}

would return 会回来的

only charset chars

but

$mystr = "macguffin";
$mycharset = array('m', 'a', 'c');

$tmpstr = str_replace($mycharset, "", $mystr);

if (!strlen($tmpstr)) {
    echo "only charset chars";
} else {
    echo "other chars";
}

would return 会回来的

other chars

HTH HTH

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM