[英]Python regex to remove alphanumeric characters without removing words at the end of the string
I'm trying to clean some text by removing alphanumeric characters from the end of the string, but I'm also removing normal words as shown on the output .我试图通过从字符串末尾删除字母数字字符来清理一些文本,但我也在删除正常单词,如output所示。 Can someone help me achieve the expected result?
有人可以帮我达到预期的结果吗?
re.sub(r'[a-zA-Z0-9/]{5,}$', '', text)
asus zenfone 3s max zc521tl
asus zenfone max plus (m1) zb570tl
asus zenfone max pro (m1) zb601kl/zb602k
nokia 3.1 c
nokia 3
asus zenfone 3 zoom ze553k
asus zenfone 3 deluxe zs570kl
blackberry keyone
htc explorer
lg tribute
acer liquid z520
Output: Output:
asus zenfone 3s max
asus zenfone max plus (m1)
asus zenfone max pro (m1)
nokia 3.1 c
nokia 3
asus zenfone 3 zoom
asus zenfone 3 deluxe
blackberry
htc
lg
acer liquid z520
Expected output:预期 output:
asus zenfone 3s max
asus zenfone max plus (m1)
asus zenfone max pro (m1)
nokia 3.1 c
nokia 3
asus zenfone 3 zoom
asus zenfone 3 deluxe
**blackberry keyone**
**htc explorer**
**lg tribute**
acer liquid z520
You can add a positive look-ahead to the regex that requires the word at the end to contain at least one digit for it to be removed: (?=\D*\d)
.您可以向正则表达式添加一个正则表达式,该表达式要求末尾的单词至少包含一个数字才能将其删除:
(?=\D*\d)
。 That will prevent it from removing normal words that don't contain numbers.这将阻止它删除不包含数字的正常单词。
The complete program:完整的程序:
#!/usr/bin/env python3
import re
texts = [
'asus zenfone 3s max zc521tl',
'asus zenfone max plus (m1) zb570tl',
'asus zenfone max pro (m1) zb601kl/zb602k',
'nokia 3.1 c',
'nokia 3',
'asus zenfone 3 zoom ze553k',
'asus zenfone 3 deluxe zs570kl',
'blackberry keyone',
'htc explorer',
'lg tribute',
'acer liquid z520',
]
for text in texts:
print(re.sub(r'(?=\D*\d)[a-zA-Z0-9/]{5,}$', '', text))
It outputs:它输出:
asus zenfone 3s max
asus zenfone max plus (m1)
asus zenfone max pro (m1)
nokia 3.1 c
nokia 3
asus zenfone 3 zoom
asus zenfone 3 deluxe
blackberry keyone
htc explorer
lg tribute
acer liquid z520
If it should be the last word in a string and there are always multiple words, you might use:如果它应该是字符串中的最后一个单词并且总是有多个单词,您可以使用:
[ \t]+(?=[a-zA-Z0-9/]{5})[a-zA-Z/]*[0-9][a-zA-Z0-9/]*[A-Za-z]$
[ \t]+
Match 1+ spaces or tabs [ \t]+
匹配 1+ 个空格或制表符(?=[a-zA-Z0-9/]{5})
Assert at least 5 chars of any of the listed (?=[a-zA-Z0-9/]{5})
断言任何列出的至少 5 个字符[a-zA-Z/]*
Match 0+ times any of the listed [a-zA-Z/]*
匹配任何列出的 0+ 次[0-9]
Match a digit [0-9]
匹配一个数字[a-zA-Z0-9/]*
Match 0+ times any of the listed in the character class [a-zA-Z0-9/]*
匹配字符 class 中列出的任何内容的 0+ 次[A-Za-z]
Match a char a-zA-Z [A-Za-z]
匹配一个字符 a-zA-Z$
End of string $
字符串结尾In the replacement use an empty string.在替换中使用空字符串。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.