[英]Extract Only Unicode Characters from a String using Regular Expressions
I want to extract Unicode characters from a String using Regular Expressions, removing ASCII, Numbers and Special Symbols from a String or a text file, is it possible using Regular Expression. 我想使用正则表达式从字符串中提取Unicode字符,从字符串或文本文件中删除ASCII,数字和特殊符号,是否可以使用正则表达式。 For instance i want only Hindi or Chinese characters from a text taken from a news article.
例如,我只希望从新闻文章中摘录文字的印地文或中文字符。
As stated above, ASCII is a subset of Unicode, so the question doesn't quite make sense as-is. 如上所述,ASCII是Unicode的一个子集,因此,这个问题没有什么意义。 If you really want to remove all codepoints below
U+0080
from the string, that's easy: 如果您确实要从字符串中删除
U+0080
以下的所有代码点,那么很简单:
re.sub(r"[\x00-\x7f]+", "", mystring)
If you want to keep only certain "whitelisted" characters, you need to specify precisely which codepoints to keep. 如果只想保留某些“列入白名单”的字符,则需要精确指定要保留的代码点。
For example, to keep Devanagari codepoints (used for writing Hindi), you can use 例如,要保留梵文代码点(用于编写印地语),可以使用
re.sub(r"[^\u0900-\u097F]+", "", mystring)
or (Python 2, thanks @bobince for the heads-up!) 或(Python 2,感谢@bobince的注意!)
re.sub(ur"[^\u0900-\u097F]+", "", mystring)
You do need to make sure that you're working on a Unicode string, so don't forget to decode/encode your input string: 您确实需要确保正在处理Unicode字符串,所以请不要忘记对输入字符串进行解码/编码:
url = 'http://www.bhaskar.com/'
data = urllib2.urlopen(url).read().decode("utf-8-sig")
regex = re.compile(ur"[^\u0900-\u097F]+")
hindionly = regex.sub("foo", data)
print hindionly.encode("utf-8")
Using the third-party regex module , you could express the pattern using unicode scripts : 使用第三方正则表达式模块 ,您可以使用unicode脚本来表达模式:
import regex
print(repr(regex.sub(ur'[^\p{Devanagari}\p{Han}]', u'', u'abc123\u0900')))
# u'\u0900'
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.