[英]Regex expression to strip ending of word
I have the following identifiers: 我有以下标识符:
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD'
id3 = '35358fr1'
id4 = 'as3d99j_br001'
I need a regex to get me the following output: 我需要一个正则表达式来获取以下输出:
id1 = '883316040119'
id2 = 'ZWEX01DE9463DB'
id3 = '35358'
id4 = 'as3d99j'
Here is what I have so far -- 这是我到目前为止的内容
re.sub(r'_?([a-zA-Z]{2,4}?\d?(00\d)?)$','',vendor_id)
It doesn't work perfectly though, here is what it gives me: 但是,它不能完美运行,这是给我的:
BAD - 883316040119_FRIENDS
GOOD - ZWEX01DE9463DB
GOOD - 35358
GOOD - as3d99j
What would be the correct regular expression to get all of them? 得到所有这些的正确的正则表达式是什么? For the first one, I basically want to strip the ending if it is only underscores and letters, so
1928h9829_bundle_hd --> 1928h9829
. 对于第一个,我基本上要删除仅包含下划线和字母的结尾,因此
1928h9829_bundle_hd --> 1928h9829
。
Please note that I have hundreds of thousands of identifiers here, and it is required that I use a regular expression. 请注意,我这里有成千上万个标识符,并且要求我使用正则表达式。 I'm not looking for a python
split()
way to do it, as it wouldn't work. 我不是在寻找python
split()
方法,因为它不起作用。
The way you present your input, I would suggest this simple regex: 您提出输入的方式,我建议使用以下简单的正则表达式:
^(?:[^_]+(?=_)|\d+)
This can be tweaked if you want to add details to the spec. 如果您想在规范中添加详细信息,可以对此进行调整。
To show you a regex demo, just because of the way the site regex101 works, we have to add \\n
(it assumes we are working on the whole file, rather than one input at a time): DEMO 为了向您展示一个正则表达式演示,仅由于正则表达式站点101的工作方式,我们必须添加
\\n
(它假定我们正在处理整个文件,而不是一次处理一个输入): DEMO
Explanation 说明
^
anchor asserts that we are at the beginning of the string ^
锚断言我们在字符串的开头 (?: ... )
matches either (?: ... )
匹配以下任一 [^_]+(?=_)
non-underscore characters (followed by an underscore, not matched) [^_]+(?=_)
非下划线字符(后跟下划线,不匹配) |
OR \\d+
digits \\d+
数字 This works for the examples: 这适用于示例:
for id in ids :
print (id)
883316040119_FRIENDS_HD
ZWEX01DE9463DB_DMD
35358fr1
as3d99j_br001
for id in ids :
hit = re.sub( "(_[A-Za-z_]*|_?[A-Za-z]{2,4}?\d?(00\d)?)$", "", id)
print (hit)
883316040119
ZWEX01DE9463DB
35358
as3d99j
When the tail contains letters and underscores, then the pattern is easygoing and strips off any number of underscores and digits; 当尾部包含字母和下划线时,则表示该模式很随和,并去除了任意数量的下划线和数字。 if the tail does not contain an underscore, or contains digits after the underscore, then it demands the pattern in the question: 0/2/3/4 letters then an optional digit then an optional zero-zero-digit.
如果尾部不包含下划线或在下划线之后包含数字,则它要求问题中的模式:0/2/3/4个字母,然后是一个可选的数字,然后是一个可选的零-零数字。
You are checking for underscore only one possible time, as ?
您只检查一次下划线,因为
?
means {0,1}
. 表示
{0,1}
。
r'(_[a-zA-Z]{2,}\d?(00[0-9])?|[a-z]{2,}\d)+$'
The following reproduces your desired results from your input. 以下内容从您的输入中再现了您想要的结果。
I would use the replace method with this regex: 我会在此正则表达式中使用replace方法:
_[^']+|(?!.*_)('[0-9]+)[^']+
and return capturing group 1 然后返回捕获组1
Perhaps: 也许:
result = re.sub("_[^']+|(?!.*_)('[0-9]+)[^']+", r"\1", subject)
The regex first looks for an underscore. 正则表达式首先查找下划线。 If it finds one, it will match everything up to but not including the next single quote;
如果找到一个,它将匹配所有内容,但不包括下一个单引号; and that will get removed.
它将被删除。
If that doesn't match, the alternative will look for a string that does NOT have an underscore; 如果不匹配,则替代方法将查找没有下划线的字符串; match and return in capturing group 1 the sequence of digits;
匹配并返回捕获组1中的数字序列; and then replace everything after the digits up to but not including the single quote.
然后替换数字之后的所有内容,但不包括单引号。
This is not subtraction approach. 这不是减法。 Just capture matched string.
只需捕获匹配的字符串。
The regex is ^[0-9]+)|(^[a-zA-Z0-9]+(?=_)
.(ie (^\\d+)|(^[\\d\\w]+(?=_))
) 正则表达式为
^[0-9]+)|(^[a-zA-Z0-9]+(?=_)
。(即(^\\d+)|(^[\\d\\w]+(?=_))
)
import re
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD'
id3 = '35358fr1'
id4 = 'as3d99j_br001'
ids = [id1, id2, id3, id4]
for i in ids:
try:
print re.match(r"(^[0-9]+)|(^[a-zA-Z0-9]+(?=_))", i).group()
except:
print "not matched"
output: 输出:
883316040119
ZWEX01DE9463DB
35358
as3d99j
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.