简体   繁体   English

如何替换除字母,数字,正斜杠和反斜杠之外的所有字符

[英]How to replace all characters except letters, numbers, forward and back slashes

Want to parse through text and return only letters, digits, forward and back slashes and replace all else with '' . 想要解析文本并仅返回字母,数字,正斜杠和反斜杠,并用''替换所有其他内容。

Is it possible to use just one regex pattern as opposed to several which then calls for looping? 是否可以只使用一个正则表达式模式而不是几个然后调用循环? Am unable to get the pattern below not to replace the back and forward slash. 我无法得到下面的模式,不能替换后退和正斜杠。

line1 = "1/R~e`p!l@@a#c$e%% ^A&l*l( S)-p_e+c=ial C{har}act[er]s ;E  xce|pt Forw:ard\" $An>d B,?a..ck Sl'as<he#s\\2"
line2 = line
RGX_PATTERN = "[^\w]", "_"

for pattern in RGX_PATTERN:
    line = re.sub(r"%s" %pattern, '', line)
print("replace1: " + line)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2

The code below from SO had been tested and found to be faster than regex but then it replaces all special characters including the / and \\ that I want to preserve. 来自SO的以下代码已经过测试,发现比正则表达式更快,但它取代了所有特殊字符,包括我想要保留的/和\\。 Is there any way to edit it to work for my use case and still maintain its edge over regex? 有没有办法编辑它以适用于我的用例并仍然保持其优于正则表达式的优势?

line2 = ''.join(e for e in line2 if e.isalnum())
print("replace2: " + line2)
#Prints: 1ReplaceAllSpecialCharactersExceptForwardAndBackSlashes2

As an extra hurdle, the text am parsing should be in ASCII form so if possible characters from any other encoding should also be replaced by '' 作为额外的障碍,正在解析的文本应该是ASCII格式,因此如果可能的话,来自任何其他编码的字符也应该替换为''

A fair bit faster and works for Unicode: 更快,适用于Unicode:

full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')

def re_replace(string):
    return re.sub(full_pattern, '', string)

If you want it really fast, this is by far the best (but slightly obscure) method: 如果你想要它真的很快,这是迄今为止最好的(但有点模糊)方法:

def wanted(character):
    return character.isalnum() or character in '\\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    # Remove all non-ASCII characters. Heavily optimised.
    string = string.encode('ascii', errors='ignore').decode('ascii')

    # Remove unwanted ASCII characters
    return string.translate(ascii_code_point_filter)

Timings: 时序:

SETUP="
busy = ''.join(chr(i) for i in range(512))

import re
full_pattern = re.compile('[^a-zA-Z0-9\\\/]|_')

def in_whitelist(character):
    return character.isalnum() or character in '\\/'

def re_replace(string):
    return re.sub(full_pattern, '', string)

def wanted(character):
    return character.isalnum() or character in '\\/'

ascii_characters = [chr(ordinal) for ordinal in range(128)]
ascii_code_point_filter = [c if wanted(c) else None for c in ascii_characters]

def fast_replace(string):
    string = string.encode('ascii', errors='ignore').decode('ascii')
    return string.translate(ascii_code_point_filter)
"

python -m timeit -s "$SETUP" "re_replace(busy)"
python -m timeit -s "$SETUP" "''.join(e for e in busy if in_whitelist(e))"
python -m timeit -s "$SETUP" "fast_replace(busy)"

Results: 结果:

10000 loops, best of 3: 63 usec per loop
10000 loops, best of 3: 135 usec per loop
100000 loops, best of 3: 4.98 usec per loop

Why can't you do something like: 你为什么不能这样做:

def in_whitelist(character):
    return character.isalnum() or character in ['\\','/']

line2 = ''.join(e for e in line2 if in_whitelist(e))

Edited as per suggestion to condense function. 按照建议来编辑缩小功能。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM