简体   繁体   English

正则表达式删除单词的结尾

[英]Regex expression to strip ending of word

I have the following identifiers: 我有以下标识符:

id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD' 
id3 = '35358fr1'
id4 = 'as3d99j_br001'

I need a regex to get me the following output: 我需要一个正则表达式来获取以下输出:

id1 = '883316040119'
id2 = 'ZWEX01DE9463DB' 
id3 = '35358'
id4 = 'as3d99j'

Here is what I have so far -- 这是我到目前为止的内容

re.sub(r'_?([a-zA-Z]{2,4}?\d?(00\d)?)$','',vendor_id)

It doesn't work perfectly though, here is what it gives me: 但是,它不能完美运行,这是给我的:

BAD  - 883316040119_FRIENDS
GOOD - ZWEX01DE9463DB
GOOD - 35358
GOOD - as3d99j

What would be the correct regular expression to get all of them? 得到所有这些的正确的正则表达式是什么? For the first one, I basically want to strip the ending if it is only underscores and letters, so 1928h9829_bundle_hd --> 1928h9829 . 对于第一个,我基本上要删除仅包含下划线和字母的结尾,因此1928h9829_bundle_hd --> 1928h9829

Please note that I have hundreds of thousands of identifiers here, and it is required that I use a regular expression. 请注意,我这里有成千上万个标识符,并且要求我使用正则表达式。 I'm not looking for a python split() way to do it, as it wouldn't work. 我不是在寻找python split()方法,因为它不起作用。

The way you present your input, I would suggest this simple regex: 您提出输入的方式,我建议使用以下简单的正则表达式:

^(?:[^_]+(?=_)|\d+)

This can be tweaked if you want to add details to the spec. 如果您想在规范中添加详细信息,可以对此进行调整。

To show you a regex demo, just because of the way the site regex101 works, we have to add \\n (it assumes we are working on the whole file, rather than one input at a time): DEMO 为了向您展示一个正则表达式演示,仅由于正则表达式站点101的工作方式,我们必须添加\\n (它假定我们正在处理整个文件,而不是一次处理一个输入): DEMO

Explanation 说明

  • The ^ anchor asserts that we are at the beginning of the string ^锚断言我们在字符串的开头
  • The non-capture group (?: ... ) matches either 非捕获组(?: ... )匹配以下任一
  • [^_]+(?=_) non-underscore characters (followed by an underscore, not matched) [^_]+(?=_)非下划线字符(后跟下划线,不匹配)
  • | OR 要么
  • \\d+ digits \\d+数字

This works for the examples: 这适用于示例:

for id in ids :
    print (id)

883316040119_FRIENDS_HD
ZWEX01DE9463DB_DMD
35358fr1
as3d99j_br001

for id in ids :
    hit = re.sub( "(_[A-Za-z_]*|_?[A-Za-z]{2,4}?\d?(00\d)?)$", "", id)
    print (hit)

883316040119
ZWEX01DE9463DB
35358
as3d99j

When the tail contains letters and underscores, then the pattern is easygoing and strips off any number of underscores and digits; 当尾部包含字母和下划线时,则表示该模式很随和,并去除了任意数量的下划线和数字。 if the tail does not contain an underscore, or contains digits after the underscore, then it demands the pattern in the question: 0/2/3/4 letters then an optional digit then an optional zero-zero-digit. 如果尾部不包含下划线或在下划线之后包含数字,则它要求问题中的模式:0/2/3/4个字母,然后是一个可选的数字,然后是一个可选的零-零数字。

You are checking for underscore only one possible time, as ? 您只检查一次下划线,因为? means {0,1} . 表示{0,1}

r'(_[a-zA-Z]{2,}\d?(00[0-9])?|[a-z]{2,}\d)+$'

The following reproduces your desired results from your input. 以下内容从您的输入中再现了您想要的结果。

I would use the replace method with this regex: 我会在此正则表达式中使用replace方法:

_[^']+|(?!.*_)('[0-9]+)[^']+

and return capturing group 1 然后返回捕获组1

Perhaps: 也许:

result = re.sub("_[^']+|(?!.*_)('[0-9]+)[^']+", r"\1", subject)

The regex first looks for an underscore. 正则表达式首先查找下划线。 If it finds one, it will match everything up to but not including the next single quote; 如果找到一个,它将匹配所有内容,但不包括下一个单引号; and that will get removed. 它将被删除。

If that doesn't match, the alternative will look for a string that does NOT have an underscore; 如果不匹配,则替代方法将查找没有下划线的字符串; match and return in capturing group 1 the sequence of digits; 匹配并返回捕获组1中的数字序列; and then replace everything after the digits up to but not including the single quote. 然后替换数字之后的所有内容,但不包括单引号。

This is not subtraction approach. 这不是减法。 Just capture matched string. 只需捕获匹配的字符串。

The regex is ^[0-9]+)|(^[a-zA-Z0-9]+(?=_) .(ie (^\\d+)|(^[\\d\\w]+(?=_)) ) 正则表达式为^[0-9]+)|(^[a-zA-Z0-9]+(?=_) 。(即(^\\d+)|(^[\\d\\w]+(?=_))

import re
id1 = '883316040119_FRIENDS_HD'
id2 = 'ZWEX01DE9463DB_DMD' 
id3 = '35358fr1'
id4 = 'as3d99j_br001'
ids = [id1, id2, id3, id4]

for i in ids:
    try:
        print re.match(r"(^[0-9]+)|(^[a-zA-Z0-9]+(?=_))", i).group()
    except:
        print "not matched"

output: 输出:

883316040119
ZWEX01DE9463DB
35358
as3d99j

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM