简体   繁体   English

使用正则表达式从字符串中仅提取Unicode字符

[英]Extract Only Unicode Characters from a String using Regular Expressions

I want to extract Unicode characters from a String using Regular Expressions, removing ASCII, Numbers and Special Symbols from a String or a text file, is it possible using Regular Expression. 我想使用正则表达式从字符串中提取Unicode字符,从字符串或文本文件中删除ASCII,数字和特殊符号,是否可以使用正则表达式。 For instance i want only Hindi or Chinese characters from a text taken from a news article. 例如,我只希望从新闻文章中摘录文字的印地文或中文字符。

As stated above, ASCII is a subset of Unicode, so the question doesn't quite make sense as-is. 如上所述,ASCII是Unicode的一个子集,因此,这个问题没有什么意义。 If you really want to remove all codepoints below U+0080 from the string, that's easy: 如果您确实要从字符串中删除U+0080以下的所有代码点,那么很简单:

re.sub(r"[\x00-\x7f]+", "", mystring)

If you want to keep only certain "whitelisted" characters, you need to specify precisely which codepoints to keep. 如果只想保留某些“列入白名单”的字符,则需要精确指定要保留的代码点。

For example, to keep Devanagari codepoints (used for writing Hindi), you can use 例如,要保留梵文代码点(用于编写印地语),可以使用

re.sub(r"[^\u0900-\u097F]+", "", mystring)

or (Python 2, thanks @bobince for the heads-up!) 或(Python 2,感谢@bobince的注意!)

re.sub(ur"[^\u0900-\u097F]+", "", mystring)

You do need to make sure that you're working on a Unicode string, so don't forget to decode/encode your input string: 您确实需要确保正在处理Unicode字符串,所以请不要忘记对输入字符串进行解码/编码:

url = 'http://www.bhaskar.com/'
data = urllib2.urlopen(url).read().decode("utf-8-sig")
regex = re.compile(ur"[^\u0900-\u097F]+")
hindionly = regex.sub("foo", data)
print hindionly.encode("utf-8")

Using the third-party regex module , you could express the pattern using unicode scripts : 使用第三方正则表达式模块 ,您可以使用unicode脚本来表达模式:

import regex
print(repr(regex.sub(ur'[^\p{Devanagari}\p{Han}]', u'', u'abc123\u0900'))) 
# u'\u0900'

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从字符串创建一个字符列表,仅使用字符串函数而不是正则表达式 - Create a List of characters from string, using only String Functions and not Regular Expressions 如何使用正则表达式仅从以下字符串中提取URL? - How to extract only the URL from the following strings using regular expressions? 使用正则表达式从文本文件中提取字符串 - Using regular expressions to extract string from text file 使用正则表达式排除字符串搜索中的字符? - using regular expressions to exclude characters in a string search? 匹配python正则表达式中的unicode字符 - matching unicode characters in python regular expressions 如何使用多个括号执行正则表达式,并且仅从包含特定字符的括号生成字符串 - How to do Regular Expressions with multiple brackets and only generate string from brackets that contains specific characters 如何使用正则表达式仅提取输入文本的某些部分? - How to extract only certain sections of an input text using regular expressions? 如何使用Python正则表达式从字符串中提取多个模式? - How to extract more than one patterns from a string using Python Regular Expressions? 如何从 txt 文件中提取字符串(数字)并使用 python 中的正则表达式转换为整数 - How to extract string (numbers) from txt file and convert to integers using regular expressions in python 使用正则表达式从列中删除字符串 - Using regular expressions to remove a string from a column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM