I have a list of Strings in python. Now I want to remove all the strings from the list that are special utf-8 characters. I want just the strings which include just the characters from "U+0021" to "U+00FF". So, do you know a way to detect if a String just contains these special characters?
Thanks :)
EDIT: I use Python 3
>>> all_strings = ["okstring", "bađštring", "goodstring"]
>>> acceptible = set(chr(i) for i in range(0x21, 0xFF + 1))
>>> simple_strings = filter(lambda s: set(s).issubset(acceptible), all_strings)
>>> list(simple_strings)
['okstring', 'goodstring']
You can use regular expression.
import re
mylist = ['str1', 'štr2', 'str3']
regexp = re.compile(r'[^\u0021-\u00FF]')
good_strs = filter(lambda s: not regexp.search(s), mylist)
[^\!-\ÿ]
defines a character set, meaning any one character not in the range from \!
to \ÿ
. The letter r
before '[\!-\ÿ]'
indicates raw string notation, it saves you a lot of escaping works of backslash ('\\'). Without it, every backslash in a regular expression would have to be prefixed with another one to escape it.
regexp.search(r'[\!-\ÿ]',s)
will scan through s
looking for the first location where the regular expression r'[^\!-\ÿ]'
produces a match, and return a corresponding match object. Return None
if no match is found.
filter()
will filter out the unwanted strings.
This answer is only valid for Python 3
What do you mean exactly by "special utf-8 characters" ?
If you mean every non-ascii character, then you can try:
s.encode('ascii', 'strict')
It will rise an UnicodeDecodeError if the string is not 100% ascii
The latin1 encoding correspond to the 256 first utf8 characters. Say differently, if c
is a unicode character with a code in [0-255]
, c.encode('latin1')
has same value as ord(c)
.
So to test whether a string has at least one character outside the [0-255] range, just try to encode it as latin1
. If it contains none, the encoding will succeed, else you will get a UnicodeEncodeError:
no_special = True
try:
s.encode('latin1')
except UnicodeEncodeError:
no_special = False
BTW, as you were told in comment unicode characters outside the [0-255] range are not special , simply they are not in the latin1 range.
Please note that the above also accepts all control characters like \\t
, \\r
or \\n
because they are legal latin1 characters. It may or not be what you want here.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.