Python，从字符串中删除所有非字母字符

Question

I am writing a python MapReduce word count program.我正在写一个 python MapReduce 字数统计程序。 Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex, but I am not sure how to implement it问题是数据中散布着许多非字母字符，我发现这篇文章Stripping everything but alphanumeric chars from a string in Python显示了一个使用正则表达式的不错的解决方案，但我不确定如何实现它

def mapfn(k, v):
    print v
    import re, string 
    pattern = re.compile('[\W_]+')
    v = pattern.match(v)
    print v
    for w in v.split():
        yield w, 1

I'm afraid I am not sure how to use the library re or even regex for that matter.恐怕我不确定如何为此使用库re甚至正则表达式。 I am not sure how to apply the regex pattern to the incoming string (line of a book) v properly to retrieve the new line without any non-alphanumeric chars.我不确定如何将正则表达式模式应用于传入字符串（书行） v以正确检索没有任何非字母数字字符的新行。

Suggestions?建议？

Answer 1

Use re.sub使用re.sub

import re

regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'

Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)或者，如果您只想删除一组特定的字符（因为在您的输入中使用撇号可能没问题...）

regex = re.compile('[,\.!?]') #etc.

Answer 2

如果您不想使用正则表达式，您可以尝试

''.join([i for i in s if i.isalpha()])

Answer 3

You can use the re.sub() function to remove these characters:您可以使用 re.sub() 函数删除这些字符：

>>> import re
>>> re.sub("[^a-zA-Z]+", "", "ABC12abc345def")
'ABCabcdef'

re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH) re.sub（匹配模式，替换字符串，要搜索的字符串）

"[^a-zA-Z]+" - look for any group of characters that are NOT a-zA-z. "[^a-zA-Z]+" - 查找任何不是 a-zA-z 的字符组。
"" - Replace the matched characters with "" "" - 用""替换匹配的字符

Answer 4

Try:尝试：

s = ''.join(filter(str.isalnum, s))

This will take every char from the string, keep only alphanumeric ones and build a string back from them.这将从字符串中取出每个字符，只保留字母数字字符并从它们构建一个字符串。

Answer 5

The fastest method is regex最快的方法是正则表达式

#Try with regex first
t0 = timeit.timeit("""
s = r2.sub('', st)

""", setup = """
import re
r2 = re.compile(r'[^a-zA-Z0-9]', re.MULTILINE)
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)

#Try with join method on filter
t0 = timeit.timeit("""
s = ''.join(filter(str.isalnum, st))

""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""",
number = 1000000)
print(t0)

#Try with only join
t0 = timeit.timeit("""
s = ''.join(c for c in st if c.isalnum())

""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)


2.6002226710006653 Method 1 Regex
5.739747313000407 Method 2 Filter + Join
6.540099570000166 Method 3 Join

Answer 6

It is advisable to use PyPi regex module if you plan to match specific Unicode property classes.如果您计划匹配特定的 Unicode 属性类，建议使用PyPi regex模块。 This library has also proven to be more stable, especially handling large texts, and yields consistent results across various Python versions.这个库也被证明更稳定，尤其是处理大文本，并在各种 Python 版本中产生一致的结果。 All you need to do is to keep it up-to-date.您需要做的就是使其保持最新状态。

If you install it (using pip intall regex or pip3 install regex ), you may use如果您安装它（使用pip intall regex pip3 install regex pip intall regex或pip3 install regex ），您可以使用

import regex
print ( regex.sub(r'\P{L}+', '', 'ABCŁąć1-2!Абв3§4“5def”') )
// => ABCŁąćАбвdef

to remove all chunks of 1 or more characters other than Unicode letters from text .从text删除除 Unicode 字母以外的所有 1 个或多个字符的块。 See an online Python demo .查看在线 Python 演示。 You may also use "".join(regex.findall(r'\\p{L}+', 'ABCŁąć1-2!Абв3§4“5def”')) to get the same result.您也可以使用"".join(regex.findall(r'\\p{L}+', 'ABCŁąć1-2!Абв3§4“5def”'))来获得相同的结果。

In Python re , in order to match any Unicode letter, one may use the [^\\W\\d_] construct ( Match any unicode letter? ).在 Python re ，为了匹配任何 Unicode 字母，可以使用[^\\W\\d_]构造（匹配任何 Unicode 字母？）。

So, to remove all non-letter characters, you may either match all letters and join the results:因此，要删除所有非字母字符，您可以匹配所有字母并加入结果：

result = "".join(re.findall(r'[^\W\d_]', text))

Or, remove all chars other than those matched with [^\\W\\d_] :或者，删除与[^\\W\\d_]匹配的字符以外的所有字符：

result = re.sub(r'([^\W\d_])|.', r'\1', text, re.DOTALL)

See the regex demo online .在线查看正则表达式演示。 However , you may get inconsistent results across various Python versions because the Unicode standard is evolving, and the set of chars matched with \\w will depend on the Python version.但是，由于 Unicode 标准在不断发展，您可能会在各种 Python 版本中得到不一致的结果，并且与\\w匹配的字符集将取决于 Python 版本。 Using PyPi regex library is highly recommended to get consistent results.强烈建议使用 PyPi regex库以获得一致的结果。

Answer 7

Here's yet another callable function that removes every that is not in plain english:这是另一个可调用的 function，它删除了所有不是纯英语的内容：

import re
remove_non_english = lambda s: re.sub(r'[^a-zA-Z\s\n\.]', ' ', s)

Usage:用法：

remove_non_english('a€bñcá`` something. 2323')
> 'a b c    something     '

Python，从字符串中删除所有非字母字符

问题描述

7 个解决方案

解决方案1
141 已采纳 2014-03-20 00:36:04

解决方案2
55 2015-03-30 15:54:47

解决方案3
38 2014-03-20 00:43:31

解决方案4
25 2015-01-05 05:16:09

解决方案5
6 2018-04-22 11:49:13

解决方案6
0 2020-06-01 13:28:09

解决方案7
0 2022-12-14 00:45:25

Python，从字符串中删除所有非字母字符

问题描述

7 个解决方案

解决方案1 141 已采纳 2014-03-20 00:36:04

解决方案2 55 2015-03-30 15:54:47

解决方案3 38 2014-03-20 00:43:31

解决方案4 25 2015-01-05 05:16:09

解决方案5 6 2018-04-22 11:49:13

解决方案6 0 2020-06-01 13:28:09

解决方案7 0 2022-12-14 00:45:25

解决方案1
141 已采纳 2014-03-20 00:36:04

解决方案2
55 2015-03-30 15:54:47

解决方案3
38 2014-03-20 00:43:31

解决方案4
25 2015-01-05 05:16:09

解决方案5
6 2018-04-22 11:49:13

解决方案6
0 2020-06-01 13:28:09

解决方案7
0 2022-12-14 00:45:25