How to check if string contains only specified character set?

Question

I'm working on string and I wonder which way is best to check if string contains only specified character set:

@  ∆  SP  0  ¡  P  ¿  p 
£  _  !  1  A  Q  a  q 
$  Φ  "  2  B  R  b  r 
¥  Γ  #  3  C  S  c  s 
è  Λ  ¤  4  D  T  d  t 
é  O  %  5  E  U  e  u 
ù  Π  &  6  F  V  f  v 
ì  Ψ  '  7  G  W  g  w 
ò  Σ  (  8  H  X  h  x 
Ç  Θ  )  9  I  Y  i  y 
LF  Ξ  *  :  J  Z  j  z 
Ø  1)  +  ;  K  Ä  k  ä 
ø  Æ  ,  <  L  Ö  l  ö 
CR  æ  q  =  M  Ñ  m  ñ 
Å  ß  .  >  N  Ü  n  ü 
å  É  /  ?  O  §  o  à

I was trying to make it done by eregi and regexp, but didn't success. Other way is to convert each char to decimal and check if it is smaller than < 137, or check each element by in_array() - which I find weak.

Anyone have better solution?

Thanks in advance.

Answer 1

I see you've already accepted another answer, but I want to explain why your attempts with regex weren't working. Hopefully it'll help you.

Firstly, I notice ereg in your tags for this question. Please note that PHP's ereg_ functions have been deprecated; you should only use the preg_ functions.

Now, if you want to use regex for this sort of thing, you would typically use a negated character class to define a list of characters you want to allow, and then look for anything else.

A character class is a list of characters enclosed in square brackets. You can negate a character class by adding a carat symbol to the start of it. So if you wanted a string that contained only 'A', 'B' or 'C', and you wanted to get warned about strings which contained anything else, you could use something like this:

$result = preg_match("/[^ABC]/",$mystring);

Your example is basically the same (but with more characters to test, obviously), except for two points: Firstly you have characters in your list which are reserved characters in Regex, and secondly, you are using non-Ascii characters.

The Regex reserved characters can be dealt with by escaping them with a leading back-slash. You just need to know what characters are reserved. Looking at your list, I see ? , / , . and + .

The second point explains why you couldn't get it working with ereg , because the ereg functions don't support unicode. Switch to using the preg functions instead, and you'll have more luck.

You still need to specify to the regex engine that you're looking for a unicode characters. This is done by adding the u modifier to the end of the regex string.

So a shortened version of your query might look like this:

$result = preg_match("/[^èΛ¤4DTdt]/u",$mystring);

It looks like you're including new lines in your list of characters, so you may also want to add the multi-line modifier m alongside that u .

For characters which can't be written (or indeed for any character, if it's easier), you can add escape sequences for their unicode character codes. Use \ where FFFF is the hex unicode reference for the character you want to match -- eg \à matches à .

I hope that gives you a better insight into regular expressions. I should add that I'm not saying that regex is necessarily the best solution to this question, nor necessarily the only solution. I have tried to make it perform optimally by using the negated character class (which means it'll fail as soon as it finds a non-matching character, and should prevent the kind of excessive backtracking which can cause regex expressions to be quite slow sometimes), so it should be reasonably performant, but I haven't tested it against other solutions.

I hope that helps.

Answer 2

As far as you're concerned for single byte charsets, you can do it with string function:

$charset = 'abc';
$test = 'abcd';
$ofCharset = strlen($test) === strspn($test, $charset); # FALSE

Otherwise you must split your string into array entries of one char each and then compare against a character table which could be a keyed array as well containing the character of the charset as key.

Answer 3

To keep the operation O(n) you could compute the ascii value of each of your test characters and place them into a hash table like so:

$testChars[$ascii] = true;

Then just loop through the subject string's characters and test if the hash table value entry is set and equates to true. If you get false for any of the characters then it contains characters not in your test set.

This would be better than using in_array because testing if $testChars[$ascii] == true is a constant O(1) lookup.

Answer 4

I know this is an old question, but no one has mentioned strpbrk . I've never tried it with odd characters, but aside from that possibly being an issue, why wouldn't this work?

Answer 5

Here's a great resource that might help you find your answer.

Advanced Regular Expression Tips and Techniques

Answer 6

if your trying to find out only if there are other characters you could just str_replace the character set to "" and then get the strlen ... If it is 0 then only those characters are there... if greater then 0 then other characters exist.

ex.

$mystr = "macguffin";
$mycharset = array('m', 'a', 'c', 'g', 'u', 'f', 'i', 'n');

$tmpstr = str_replace($mycharset, "", $mystr);

if (!strlen($tmpstr)) {
    echo "only charset chars";
} else {
    echo "other chars";
}

would return

only charset chars

but

$mystr = "macguffin";
$mycharset = array('m', 'a', 'c');

$tmpstr = str_replace($mycharset, "", $mystr);

if (!strlen($tmpstr)) {
    echo "only charset chars";
} else {
    echo "other chars";
}

would return

other chars

HTH

How to check if string contains only specified character set?

Question

6 answers

solution1
9 ACCPTED 2011-07-08 09:35:31

solution2
3 2011-07-06 19:16:43

solution3
1 2011-07-06 18:53:56

solution4
0 2013-11-11 02:56:00

solution5
0 2011-07-06 18:42:30

solution6
0 2011-07-06 20:13:34

How to check if string contains only specified character set?

Question

6 answers

solution1 9 ACCPTED 2011-07-08 09:35:31

solution2 3 2011-07-06 19:16:43

solution3 1 2011-07-06 18:53:56

solution4 0 2013-11-11 02:56:00

solution5 0 2011-07-06 18:42:30

solution6 0 2011-07-06 20:13:34

solution1
9 ACCPTED 2011-07-08 09:35:31

solution2
3 2011-07-06 19:16:43

solution3
1 2011-07-06 18:53:56

solution4
0 2013-11-11 02:56:00

solution5
0 2011-07-06 18:42:30

solution6
0 2011-07-06 20:13:34