简体   繁体   中英

How to detect if a String has specific UTF-8 characters in it? (Python)

I have a list of Strings in python. Now I want to remove all the strings from the list that are special utf-8 characters. I want just the strings which include just the characters from "U+0021" to "U+00FF". So, do you know a way to detect if a String just contains these special characters?

Thanks :)

EDIT: I use Python 3

>>> all_strings = ["okstring", "bađštring", "goodstring"]
>>> acceptible = set(chr(i) for i in range(0x21, 0xFF + 1))
>>> simple_strings = filter(lambda s: set(s).issubset(acceptible), all_strings)
>>> list(simple_strings)
['okstring', 'goodstring']

You can use regular expression.

import re
mylist = ['str1', 'štr2', 'str3']
regexp = re.compile(r'[^\u0021-\u00FF]')
good_strs = filter(lambda s: not regexp.search(s), mylist)

[^\!-\ÿ] defines a character set, meaning any one character not in the range from \! to \ÿ . The letter r before '[\!-\ÿ]' indicates raw string notation, it saves you a lot of escaping works of backslash ('\\'). Without it, every backslash in a regular expression would have to be prefixed with another one to escape it.

regexp.search(r'[\!-\ÿ]',s) will scan through s looking for the first location where the regular expression r'[^\!-\ÿ]' produces a match, and return a corresponding match object. Return None if no match is found.

filter() will filter out the unwanted strings.

This answer is only valid for Python 3

What do you mean exactly by "special utf-8 characters" ?

If you mean every non-ascii character, then you can try:

s.encode('ascii', 'strict')

It will rise an UnicodeDecodeError if the string is not 100% ascii

The latin1 encoding correspond to the 256 first utf8 characters. Say differently, if c is a unicode character with a code in [0-255] , c.encode('latin1') has same value as ord(c) .

So to test whether a string has at least one character outside the [0-255] range, just try to encode it as latin1 . If it contains none, the encoding will succeed, else you will get a UnicodeEncodeError:

no_special = True
try:
    s.encode('latin1')
except UnicodeEncodeError:
    no_special = False

BTW, as you were told in comment unicode characters outside the [0-255] range are not special , simply they are not in the latin1 range.

Please note that the above also accepts all control characters like \\t , \\r or \\n because they are legal latin1 characters. It may or not be what you want here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM