简体   繁体   English

Python,从字符串中删除所有非字母字符

[英]Python, remove all non-alphabet chars from string

I am writing a python MapReduce word count program.我正在写一个 python MapReduce 字数统计程序。 Problem is that there are many non-alphabet chars strewn about in the data, I have found this post Stripping everything but alphanumeric chars from a string in Python which shows a nice solution using regex, but I am not sure how to implement it问题是数据中散布着许多非字母字符,我发现这篇文章Stripping everything but alphanumeric chars from a string in Python显示了一个使用正则表达式的不错的解决方案,但我不确定如何实现它

def mapfn(k, v):
    print v
    import re, string 
    pattern = re.compile('[\W_]+')
    v = pattern.match(v)
    print v
    for w in v.split():
        yield w, 1

I'm afraid I am not sure how to use the library re or even regex for that matter.恐怕我不确定如何为此使用库re甚至正则表达式。 I am not sure how to apply the regex pattern to the incoming string (line of a book) v properly to retrieve the new line without any non-alphanumeric chars.我不确定如何将正则表达式模式应用于传入字符串(书行) v以正确检索没有任何非字母数字字符的新行。

Suggestions?建议?

Use re.sub使用re.sub

import re

regex = re.compile('[^a-zA-Z]')
#First parameter is the replacement, second parameter is your input string
regex.sub('', 'ab3d*E')
#Out: 'abdE'

Alternatively, if you only want to remove a certain set of characters (as an apostrophe might be okay in your input...)或者,如果您只想删除一组特定的字符(因为在您的输入中使用撇号可能没问题...)

regex = re.compile('[,\.!?]') #etc.

如果您不想使用正则表达式,您可以尝试

''.join([i for i in s if i.isalpha()])

You can use the re.sub() function to remove these characters:您可以使用 re.sub() 函数删除这些字符:

>>> import re
>>> re.sub("[^a-zA-Z]+", "", "ABC12abc345def")
'ABCabcdef'

re.sub(MATCH PATTERN, REPLACE STRING, STRING TO SEARCH) re.sub(匹配模式,替换字符串,要搜索的字符串)

  • "[^a-zA-Z]+" - look for any group of characters that are NOT a-zA-z. "[^a-zA-Z]+" - 查找任何不是 a-zA-z 的字符组。
  • "" - Replace the matched characters with "" "" - 用""替换匹配的字符

Try:尝试:

s = ''.join(filter(str.isalnum, s))

This will take every char from the string, keep only alphanumeric ones and build a string back from them.这将从字符串中取出每个字符,只保留字母数字字符并从它们构建一个字符串。

The fastest method is regex最快的方法是正则表达式

#Try with regex first
t0 = timeit.timeit("""
s = r2.sub('', st)

""", setup = """
import re
r2 = re.compile(r'[^a-zA-Z0-9]', re.MULTILINE)
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)

#Try with join method on filter
t0 = timeit.timeit("""
s = ''.join(filter(str.isalnum, st))

""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""",
number = 1000000)
print(t0)

#Try with only join
t0 = timeit.timeit("""
s = ''.join(c for c in st if c.isalnum())

""", setup = """
st = 'abcdefghijklmnopqrstuvwxyz123456789!@#$%^&*()-=_+'
""", number = 1000000)
print(t0)


2.6002226710006653 Method 1 Regex
5.739747313000407 Method 2 Filter + Join
6.540099570000166 Method 3 Join

It is advisable to use PyPi regex module if you plan to match specific Unicode property classes.如果您计划匹配特定的 Unicode 属性类,建议使用PyPi regex模块 This library has also proven to be more stable, especially handling large texts, and yields consistent results across various Python versions.这个库也被证明更稳定,尤其是处理大文本,并在各种 Python 版本中产生一致的结果。 All you need to do is to keep it up-to-date.您需要做的就是使其保持最新状态。

If you install it (using pip intall regex or pip3 install regex ), you may use如果您安装它(使用pip intall regex pip3 install regex pip intall regexpip3 install regex ),您可以使用

import regex
print ( regex.sub(r'\P{L}+', '', 'ABCŁąć1-2!Абв3§4“5def”') )
// => ABCŁąćАбвdef

to remove all chunks of 1 or more characters other than Unicode letters from text .text删除除 Unicode 字母以外的所有 1 个或多个字符的块。 See an online Python demo .查看在线 Python 演示 You may also use "".join(regex.findall(r'\\p{L}+', 'ABCŁąć1-2!Абв3§4“5def”')) to get the same result.您也可以使用"".join(regex.findall(r'\\p{L}+', 'ABCŁąć1-2!Абв3§4“5def”'))来获得相同的结果。

In Python re , in order to match any Unicode letter, one may use the [^\\W\\d_] construct ( Match any unicode letter? ).在 Python re ,为了匹配任何 Unicode 字母,可以使用[^\\W\\d_]构造(匹配任何 Unicode 字母? )。

So, to remove all non-letter characters, you may either match all letters and join the results:因此,要删除所有非字母字符,您可以匹配所有字母并加入结果:

result = "".join(re.findall(r'[^\W\d_]', text))

Or, remove all chars other than those matched with [^\\W\\d_] :或者,删除与[^\\W\\d_]匹配的字符以外的所有字符:

result = re.sub(r'([^\W\d_])|.', r'\1', text, re.DOTALL)

See the regex demo online .在线查看正则表达式演示 However , you may get inconsistent results across various Python versions because the Unicode standard is evolving, and the set of chars matched with \\w will depend on the Python version.但是,由于 Unicode 标准在不断发展,您可能会在各种 Python 版本中得到不一致的结果,并且与\\w匹配的字符集将取决于 Python 版本。 Using PyPi regex library is highly recommended to get consistent results.强烈建议使用 PyPi regex库以获得一致的结果。

Here's yet another callable function that removes every that is not in plain english:这是另一个可调用的 function,它删除了所有不是纯英语的内容:

import re
remove_non_english = lambda s: re.sub(r'[^a-zA-Z\s\n\.]', ' ', s)

Usage:用法:

remove_non_english('a€bñcá`` something. 2323')
> 'a b c    something     '

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在不连接单词的情况下从 Python 中的字符串中删除除空格之外的所有非字母字符 - How remove all non-alphabet chars excluding white space from string in Python without joining the words 从列表中删除所有非字母字符 - Removing all non-alphabet characters from a list 从字符串的开头和结尾删除非字母字符 - remove non alphabet chars from the beginning and the end of the string re python-在:符号前的字符串中捕获空格或非字母字符 - re python - Capture space or non-alphabet characters in a string before a : symbol 删除非字母字符,转换为小写字母,并删除列表列表中小于3个字母的单词 - Remove non-alphabet characters, convert to lowercase, and remove words smaller than 3 letters for a list of lists 关于非字母字符的编码问题 - Encoding issues with regards to non-alphabet characters 删除非字母(最好使用 lambda func 或其他简短但不是 for-loop 的东西) - Remove non-alphabet (preferably using lambda func or something else short but not for-loop) 从 Python 中带重音的字符串中删除所有非字母字符 - Removing all non-letter chars from a string with accents in Python 从字符串中删除所有出现的几个字符 - Remove all occurrences of several chars from a string 将字符串转换为Python中所有非字母数字字符的列表 - Turning a string into a list of all its non-alphanumeric chars in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM