简体   繁体   English

从字符串中删除所有特殊字符、标点符号和空格

[英]Remove all special characters, punctuation and spaces from string

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.我需要从字符串中删除所有特殊字符、标点符号和空格,以便我只有字母和数字。

This can be done without regex:这可以在没有正则表达式的情况下完成:

>>> string = "Special $#! characters   spaces 888323"
>>> ''.join(e for e in string if e.isalnum())
'Specialcharactersspaces888323'

You can usestr.isalnum :您可以使用str.isalnum

 S.isalnum() -> bool Return True if all characters in S are alphanumeric and there is at least one character in S, False otherwise.

If you insist on using regex, other solutions will do fine.如果您坚持使用正则表达式,其他解决方案也可以。 However note that if it can be done without using a regular expression, that's the best way to go about it.但是请注意,如果不使用正则表达式就可以完成,那是最好的方法 go 关于它。

Here is a regex to match a string of characters that are not a letters or numbers:这是一个正则表达式,用于匹配不是字母或数字的字符串:

[^A-Za-z0-9]+

Here is the Python command to do a regex substitution:这是执行正则表达式替换的 Python 命令:

re.sub('[^A-Za-z0-9]+', '', mystring)

Shorter way:较短的方式:

import re
cleanString = re.sub('\W+','', string )

If you want spaces between words and numbers substitute '' with ' '如果您想要单词和数字之间的空格,请将 '' 替换为 ' '

TLDR TLDR

I timed the provided answers.我为提供的答案计时。

import re
re.sub('\W+','', string)

is typically 3x faster than the next fastest provided top answer.通常比提供的下一个最快的最佳答案快 3 倍

Caution should be taken when using this option.使用此选项时应小心。 Some special characters (eg ø ) may not be striped using this method.某些特殊字符(例如ø )可能无法使用此方法进行条带化。


After seeing this, I was interested in expanding on the provided answers by finding out which executes in the least amount of time, so I went through and checked some of the proposed answers with timeit against two of the example strings:看到这个之后,我有兴趣通过找出哪个执行的时间最短来扩展提供的答案,所以我仔细检查了一些建议的答案timeit针对两个示例字符串:

  • string1 = 'Special $#! characters spaces 888323'
  • string2 = 'how much for the maple syrup? $20.99? That s ridiculous!!!'

Example 1示例 1

'.join(e for e in string if e.isalnum())
  • string1 - Result: 10.7061979771 string1 - 结果:10.7061979771
  • string2 - Result: 7.78372597694 string2 - 结果:7.78372597694

Example 2示例 2

import re
re.sub('[^A-Za-z0-9]+', '', string)
  • string1 - Result: 7.10785102844 string1 - 结果:7.10785102844
  • string2 - Result: 4.12814903259 string2 - 结果:4.12814903259

Example 3示例 3

import re
re.sub('\W+','', string)
  • string1 - Result: 3.11899876595 string1 - 结果:3.11899876595
  • string2 - Result: 2.78014397621 string2 - 结果:2.78014397621

The above results are a product of the lowest returned result from an average of: repeat(3, 2000000)上面的结果是从平均值中返回的最低结果的乘积: repeat(3, 2000000)

Example 3 can be 3x faster than Example 1 .示例 3可以比示例 1快 3 倍。

Python 2.* Python 2.*

I think just filter(str.isalnum, string) works我认为filter(str.isalnum, string)有效

In [20]: filter(str.isalnum, 'string with special chars like !,#$% etcs.')
Out[20]: 'stringwithspecialcharslikeetcs'

Python 3.* Python 3.*

In Python3, filter( ) function would return an itertable object (instead of string unlike in above).在 Python3 中, filter( ) function 将返回一个可迭代的 object(而不是上面的字符串)。 One has to join back to get a string from itertable:必须加入回来才能从 itertable 中获取字符串:

''.join(filter(str.isalnum, string)) 

or to pass list in join use ( not sure but can be fast a bit )或在加入使用中传递list不确定但可以快一点

''.join([*filter(str.isalnum, string)])

note: unpacking in [*args] valid from Python >= 3.5注意:从Python >= 3.5 开始,在[*args]中解包有效

#!/usr/bin/python
import re

strs = "how much for the maple syrup? $20.99? That's ricidulous!!!"
print strs
nstr = re.sub(r'[?|$|.|!]',r'',strs)
print nstr
nestr = re.sub(r'[^a-zA-Z0-9 ]',r'',nstr)
print nestr

you can add more special character and that will be replaced by '' means nothing ie they will be removed.您可以添加更多特殊字符,将被替换为 '' 意味着什么都没有,即它们将被删除。

Differently than everyone else did using regex, I would try to exclude every character that is not what I want, instead of enumerating explicitly what I don't want.与使用正则表达式的其他人不同,我会尝试排除所有不是我想要的字符,而不是明确枚举我不想要的字符。

For example, if I want only characters from 'a to z' (upper and lower case) and numbers, I would exclude everything else:例如,如果我只想要从“a 到 z”(大写和小写)的字符和数字,我将排除其他所有内容:

import re
s = re.sub(r"[^a-zA-Z0-9]","",s)

This means "substitute every character that is not a number, or a character in the range 'a to z' or 'A to Z' with an empty string".这意味着“用空字符串替换不是数字的每个字符,或者‘a 到 z’或‘A 到 Z’范围内的字符”。

In fact, if you insert the special character ^ at the first place of your regex, you will get the negation.事实上,如果您在正则表达式的第一位插入特殊字符^ ,您将得到否定。

Extra tip: if you also need to lowercase the result, you can make the regex even faster and easier, as long as you won't find any uppercase now.额外提示:如果您还需要小写结果,您可以使正则表达式更快更容易,只要您现在找不到任何大写字母即可。

import re
s = re.sub(r"[^a-z0-9]","",s.lower())

string.punctuation contains following characters: string.punctuation 包含以下字符:

',"#$%&\'()*+.-:/;?<=>?@[\]^_`{|}~' ',"#$%&\'()*+.-:/;?<=>?@[\]^_`{|}~'

You can use translate and maketrans functions to map punctuations to empty values (replace)您可以使用 translate 和 maketrans 函数将 map 标点符号转换为空值(替换)

import string

'This, is. A test!'.translate(str.maketrans('', '', string.punctuation))

Output: Output:

'This is A test'
s = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", s)

Assuming you want to use a regex and you want/need Unicode-cognisant 2.x code that is 2to3-ready:假设您想使用正则表达式并且您想要/需要 Unicode-cognisant 2.x 代码,即 2to3-ready:

>>> import re
>>> rx = re.compile(u'[\W_]+', re.UNICODE)
>>> data = u''.join(unichr(i) for i in range(256))
>>> rx.sub(u'', data)
u'0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb2 [snip] \xfe\xff'
>>>

The most generic approach is using the 'categories' of the unicodedata.table which classifies every single character.最通用的方法是使用 unicodedata.table 的“类别”,它对每个字符进行分类。 Eg the following code filters only printable characters based on their category:例如,以下代码仅根据类别过滤可打印字符:

import unicodedata
# strip of crap characters (based on the Unicode database
# categorization:
# http://www.sql-und-xml.de/unicode-database/#kategorien

PRINTABLE = set(('Lu', 'Ll', 'Nd', 'Zs'))

def filter_non_printable(s):
    result = []
    ws_last = False
    for c in s:
        c = unicodedata.category(c) in PRINTABLE and c or u'#'
        result.append(c)
    return u''.join(result).replace(u'#', u' ')

Look at the given URL above for all related categories.查看上面给定的 URL 以了解所有相关类别。 You also can of course filter by the punctuation categories.您当然也可以按标点符号类别进行过滤。

For other languages like German, Spanish, Danish, French etc that contain special characters (like German "Umlaute" as ü , ä , ö ) simply add these to the regex search string:对于其他包含特殊字符的语言,如德语、西班牙语、丹麦语、法语等(如德语“Umlaute”如üäö ),只需将这些添加到正则表达式搜索字符串中:

Example for German:德语示例:

re.sub('[^A-ZÜÖÄa-z0-9]+', '', mystring)

This will remove all special characters, punctuation, and spaces from a string and only have numbers and letters.这将从字符串中删除所有特殊字符、标点符号和空格,只包含数字和字母。

import re

sample_str = "Hel&&lo %% Wo$#rl@d"

# using isalnum()
print("".join(k for k in sample_str if k.isalnum()))


# using regex
op2 = re.sub("[^A-Za-z]", "", sample_str)
print(f"op2 = ", op2)


special_char_list = ["$", "@", "#", "&", "%"]

# using list comprehension
op1 = "".join([k for k in sample_str if k not in special_char_list])
print(f"op1 = ", op1)


# using lambda function
op3 = "".join(filter(lambda x: x not in special_char_list, sample_str))
print(f"op3 = ", op3)

Use translate:使用翻译:

import string

def clean(instr):
    return instr.translate(None, string.punctuation + ' ')

Caveat: Only works on ascii strings.警告:仅适用于 ascii 字符串。

I need to remove all special characters, punctuation and spaces from a string so that I only have letters and numbers.我需要从字符串中删除所有特殊字符、标点符号和空格,以便我只有字母和数字。

This will remove all non-alphanumeric characters except spaces.这将删除除空格之外的所有非字母数字字符。

string = "Special $#! characters   spaces 888323"
''.join(e for e in string if (e.isalnum() or e.isspace()))

Special characters spaces 888323特殊字符空格 888323

import re
my_string = """Strings are amongst the most popular data types in Python. We can create the strings by enclosing characters in quotes. Python treats single quotes the 

same as double quotes."""与双引号相同。"""

# if we need to count the word python that ends with or without ',' or '.' at end

count = 0
for i in text:
    if i.endswith("."):
        text[count] = re.sub("^([a-z]+)(.)?$", r"\1", i)
    count += 1
print("The count of Python : ", text.count("python"))

After 10 Years, below I wrote there is the best solution. 10 年后,我在下面写了最好的解决方案。 You can remove/clean all special characters, punctuation, ASCII characters and spaces from the string.您可以从字符串中删除/清除所有特殊字符、标点符号、ASCII 字符和空格。

from clean_text import clean

string = 'Special $#! characters   spaces 888323'
new = clean(string,lower=False,no_currency_symbols=True, no_punct = True,replace_with_currency_symbol='')
print(new)
Output ==> 'Special characters spaces 888323'
you can replace space if you want.
update = new.replace(' ','')
print(update)
Output ==> 'Specialcharactersspaces888323'
function regexFuntion(st) {
  const regx = /[^\w\s]/gi; // allow : [a-zA-Z0-9, space]
  st = st.replace(regx, ''); // remove all data without [a-zA-Z0-9, space]
  st = st.replace(/\s\s+/g, ' '); // remove multiple space

  return st;
}

console.log(regexFuntion('$Hello; # -world--78asdf+-===asdflkj******lkjasdfj67;'));
// Output: Hello world78asdfasdflkjlkjasdfj67
import re
abc = "askhnl#$%askdjalsdk"
ddd = abc.replace("#$%","")
print (ddd)

and you shall see your result as你会看到你的结果

'askhnlaskdjalsdk 'askhnlaskdjalsdk

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM