正则表达式删除单词的结尾

Question

I have the following identifiers: 我有以下标识符：

id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD' 
id3 = '35358fr1'
id4 = 'as3d99j_br001'

I need a regex to get me the following output: 我需要一个正则表达式来获取以下输出：

id1 = '883316040119'
id2 = 'ZWEX01DE9463DB' 
id3 = '35358'
id4 = 'as3d99j'

Here is what I have so far -- 这是我到目前为止的内容

re.sub(r'_?([a-zA-Z]{2,4}?\d?(00\d)?)$','',vendor_id)

It doesn't work perfectly though, here is what it gives me: 但是，它不能完美运行，这是给我的：

BAD  - 883316040119_FRIENDS
GOOD - ZWEX01DE9463DB
GOOD - 35358
GOOD - as3d99j

What would be the correct regular expression to get all of them? 得到所有这些的正确的正则表达式是什么？ For the first one, I basically want to strip the ending if it is only underscores and letters, so 1928h9829_bundle_hd --> 1928h9829 . 对于第一个，我基本上要删除仅包含下划线和字母的结尾，因此1928h9829_bundle_hd --> 1928h9829 。

Please note that I have hundreds of thousands of identifiers here, and it is required that I use a regular expression. 请注意，我这里有成千上万个标识符，并且要求我使用正则表达式。 I'm not looking for a python split() way to do it, as it wouldn't work. 我不是在寻找python split()方法，因为它不起作用。

Answer 1

The way you present your input, I would suggest this simple regex: 您提出输入的方式，我建议使用以下简单的正则表达式：

^(?:[^_]+(?=_)|\d+)

This can be tweaked if you want to add details to the spec. 如果您想在规范中添加详细信息，可以对此进行调整。

To show you a regex demo, just because of the way the site regex101 works, we have to add \\n (it assumes we are working on the whole file, rather than one input at a time): DEMO 为了向您展示一个正则表达式演示，仅由于正则表达式站点101的工作方式，我们必须添加\\n （它假定我们正在处理整个文件，而不是一次处理一个输入）： DEMO

Explanation 说明

The ^ anchor asserts that we are at the beginning of the string ^锚断言我们在字符串的开头
The non-capture group (?: ... ) matches either 非捕获组(?: ... )匹配以下任一
[^_]+(?=_) non-underscore characters (followed by an underscore, not matched) [^_]+(?=_)非下划线字符（后跟下划线，不匹配）
| OR 要么
\\d+ digits \\d+数字

Answer 2

This works for the examples: 这适用于示例：

for id in ids :
    print (id)

883316040119_FRIENDS_HD
ZWEX01DE9463DB_DMD
35358fr1
as3d99j_br001

for id in ids :
    hit = re.sub( "(_[A-Za-z_]*|_?[A-Za-z]{2,4}?\d?(00\d)?)$", "", id)
    print (hit)

883316040119
ZWEX01DE9463DB
35358
as3d99j

When the tail contains letters and underscores, then the pattern is easygoing and strips off any number of underscores and digits; 当尾部包含字母和下划线时，则表示该模式很随和，并去除了任意数量的下划线和数字。 if the tail does not contain an underscore, or contains digits after the underscore, then it demands the pattern in the question: 0/2/3/4 letters then an optional digit then an optional zero-zero-digit. 如果尾部不包含下划线或在下划线之后包含数字，则它要求问题中的模式：0/2/3/4个字母，然后是一个可选的数字，然后是一个可选的零-零数字。

Answer 3

You are checking for underscore only one possible time, as ? 您只检查一次下划线，因为? means {0,1} . 表示{0,1} 。

r'(_[a-zA-Z]{2,}\d?(00[0-9])?|[a-z]{2,}\d)+$'

Answer 4

The following reproduces your desired results from your input. 以下内容从您的输入中再现了您想要的结果。

I would use the replace method with this regex: 我会在此正则表达式中使用replace方法：

_[^']+|(?!.*_)('[0-9]+)[^']+

and return capturing group 1 然后返回捕获组1

Perhaps: 也许：

result = re.sub("_[^']+|(?!.*_)('[0-9]+)[^']+", r"\1", subject)

The regex first looks for an underscore. 正则表达式首先查找下划线。 If it finds one, it will match everything up to but not including the next single quote; 如果找到一个，它将匹配所有内容，但不包括下一个单引号； and that will get removed. 它将被删除。

If that doesn't match, the alternative will look for a string that does NOT have an underscore; 如果不匹配，则替代方法将查找没有下划线的字符串； match and return in capturing group 1 the sequence of digits; 匹配并返回捕获组1中的数字序列； and then replace everything after the digits up to but not including the single quote. 然后替换数字之后的所有内容，但不包括单引号。

Answer 5

This is not subtraction approach. 这不是减法。 Just capture matched string. 只需捕获匹配的字符串。

The regex is ^[0-9]+)|(^[a-zA-Z0-9]+(?=_) .(ie (^\\d+)|(^[\\d\\w]+(?=_)) ) 正则表达式为^[0-9]+)|(^[a-zA-Z0-9]+(?=_) 。（即(^\\d+)|(^[\\d\\w]+(?=_)) ）

import re
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD' 
id3 = '35358fr1'
id4 = 'as3d99j_br001'
ids = [id1, id2, id3, id4]

for i in ids:
    try:
        print re.match(r"(^[0-9]+)|(^[a-zA-Z0-9]+(?=_))", i).group()
    except:
        print "not matched"

output: 输出：

883316040119
ZWEX01DE9463DB
35358
as3d99j

正则表达式删除单词的结尾

问题描述

5 个解决方案

解决方案1
2 2014-07-29 02:27:54

解决方案2
1 已采纳 2014-07-29 02:27:34

解决方案3
0 2014-07-29 01:55:39

解决方案4
0 2014-07-29 02:10:09

解决方案5
0 2014-07-29 02:56:52

正则表达式删除单词的结尾

问题描述

5 个解决方案

解决方案1 2 2014-07-29 02:27:54

解决方案2 1 已采纳 2014-07-29 02:27:34

解决方案3 0 2014-07-29 01:55:39

解决方案4 0 2014-07-29 02:10:09

解决方案5 0 2014-07-29 02:56:52

解决方案1
2 2014-07-29 02:27:54

解决方案2
1 已采纳 2014-07-29 02:27:34

解决方案3
0 2014-07-29 01:55:39

解决方案4
0 2014-07-29 02:10:09

解决方案5
0 2014-07-29 02:56:52