[英]Python, remove all non-alphabet chars from string
I am writing a python MapReduce word count program.我正在写一个 python MapReduce 字数统计程序。 Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex, but I am not sure how to implement it
问题是数据中散布着许多非字母字符,我发现这篇文章Stripping everything but alphanumeric chars from a string in Python显示了一个使用正则表达式的不错的解决方案,但我不确定如何实现它
def mapfn(k, v):
print v
import re, string
pattern = re.compile('[\W_]+')
v = pattern.match(v)
print v
for w in v.split():
yield w, 1
I'm afraid I am not sure how to use the library re
or even regex for that matter.恐怕我不确定如何为此使用库
re
甚至正则表达式。 I am not sure how to apply the regex pattern to the incoming string (line of a book) v
properly to retrieve the new line without any non-alphanumeric chars.我不确定如何将正则表达式模式应用于传入字符串(书行)
v
以正确检索没有任何非字母数字字符的新行。
Suggestions?建议?
import re
regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'
Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)或者,如果您只想删除一组特定的字符(因为在您的输入中使用撇号可能没问题...)
regex = re.compile('[,\.!?]') #etc.
如果您不想使用正则表达式,您可以尝试
''.join([i for i in s if i.isalpha()])
You can use the re.sub() function to remove these characters:您可以使用 re.sub() 函数删除这些字符:
>>> import re
>>> re.sub("[^a-zA-Z]+", "", "ABC12abc345def")
'ABCabcdef'
re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH) re.sub(匹配模式,替换字符串,要搜索的字符串)
"[^a-zA-Z]+"
- look for any group of characters that are NOT a-zA-z. "[^a-zA-Z]+"
- 查找任何不是 a-zA-z 的字符组。""
- Replace the matched characters with "" ""
- 用""
替换匹配的字符Try:尝试:
s = ''.join(filter(str.isalnum, s))
This will take every char from the string, keep only alphanumeric ones and build a string back from them.这将从字符串中取出每个字符,只保留字母数字字符并从它们构建一个字符串。
The fastest method is regex最快的方法是正则表达式
#Try with regex first
t0 = timeit.timeit("""
s = r2.sub('', st)
""", setup = """
import re
r2 = re.compile(r'[^a-zA-Z0-9]', re.MULTILINE)
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)
#Try with join method on filter
t0 = timeit.timeit("""
s = ''.join(filter(str.isalnum, st))
""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""",
number = 1000000)
print(t0)
#Try with only join
t0 = timeit.timeit("""
s = ''.join(c for c in st if c.isalnum())
""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)
2.6002226710006653 Method 1 Regex
5.739747313000407 Method 2 Filter + Join
6.540099570000166 Method 3 Join
It is advisable to use PyPi regex
module if you plan to match specific Unicode property classes.如果您计划匹配特定的 Unicode 属性类,建议使用PyPi
regex
模块。 This library has also proven to be more stable, especially handling large texts, and yields consistent results across various Python versions.这个库也被证明更稳定,尤其是处理大文本,并在各种 Python 版本中产生一致的结果。 All you need to do is to keep it up-to-date.
您需要做的就是使其保持最新状态。
If you install it (using pip intall regex
or pip3 install regex
), you may use如果您安装它(使用
pip intall regex
pip3 install regex
pip intall regex
或pip3 install regex
),您可以使用
import regex
print ( regex.sub(r'\P{L}+', '', 'ABCŁąć1-2!Абв3§4“5def”') )
// => ABCŁąćАбвdef
to remove all chunks of 1 or more characters other than Unicode letters from text
.从
text
删除除 Unicode 字母以外的所有 1 个或多个字符的块。 See an online Python demo .查看在线 Python 演示。 You may also use
"".join(regex.findall(r'\\p{L}+', 'ABCŁąć1-2!Абв3§4“5def”'))
to get the same result.您也可以使用
"".join(regex.findall(r'\\p{L}+', 'ABCŁąć1-2!Абв3§4“5def”'))
来获得相同的结果。
In Python re
, in order to match any Unicode letter, one may use the [^\\W\\d_]
construct ( Match any unicode letter? ).在 Python
re
,为了匹配任何 Unicode 字母,可以使用[^\\W\\d_]
构造(匹配任何 Unicode 字母? )。
So, to remove all non-letter characters, you may either match all letters and join the results:因此,要删除所有非字母字符,您可以匹配所有字母并加入结果:
result = "".join(re.findall(r'[^\W\d_]', text))
Or, remove all chars other than those matched with [^\\W\\d_]
:或者,删除与
[^\\W\\d_]
匹配的字符以外的所有字符:
result = re.sub(r'([^\W\d_])|.', r'\1', text, re.DOTALL)
See the regex demo online .在线查看正则表达式演示。 However , you may get inconsistent results across various Python versions because the Unicode standard is evolving, and the set of chars matched with
\\w
will depend on the Python version.但是,由于 Unicode 标准在不断发展,您可能会在各种 Python 版本中得到不一致的结果,并且与
\\w
匹配的字符集将取决于 Python 版本。 Using PyPi regex
library is highly recommended to get consistent results.强烈建议使用 PyPi
regex
库以获得一致的结果。
Here's yet another callable function that removes every that is not in plain english:这是另一个可调用的 function,它删除了所有不是纯英语的内容:
import re
remove_non_english = lambda s: re.sub(r'[^a-zA-Z\s\n\.]', ' ', s)
Usage:用法:
remove_non_english('a€bñcá`` something. 2323')
> 'a b c something '
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.