简体   繁体   English

如何从文本匹配组中排除某些字符?

[英]How to exclude some characters from the text matched group?

I am going to match two cases: 123456-78-9, or 123456789. My goal is to retrieve 123456789 from either case, ie to exclude the '-' from the first case, no need to mention that the second case is quite straightforward.我将匹配两种情况:123456-78-9 或 123456789。我的目标是从任何一种情况中检索 123456789,即从第一种情况中排除“-”,无需提及第二种情况非常简单.

I have tried to use a regex like r"\b(\d+(?:-)?\d+(?:-)?\d)\b" , but it still gives '123456-78-9' back to me.我尝试使用像r"\b(\d+(?:-)?\d+(?:-)?\d)\b"这样的正则表达式,但它仍然给我 '123456-78-9' .

what is the right regex I should use?我应该使用什么正确的正则表达式? Though I know do it in two steps: 1) get three parts of digits by regex 2) use another line to concat them, but I still prefer a regex so that the code is more elegant.虽然我知道分两步完成:1)通过正则表达式获取三部分数字 2)使用另一行来连接它们,但我仍然更喜欢正则表达式,以便代码更优雅。

Thanks for any advices!感谢您的任何建议!

You can use r'(\d{6})(-?)(\d{2})\2(\d)'您可以使用r'(\d{6})(-?)(\d{2})\2(\d)'
Then Join groups 1, 3 and 4, or replace using "\\1\\3\\4"然后加入组 1、3 和 4,或使用"\\1\\3\\4"替换

Will only match these two inputs:只会匹配这两个输入:

123456-78-9, or 123456789 123456-78-9,或 123456789

It's up to you to put boundary conditions on it if needed.如果需要,您可以为其设置边界条件。

https://regex101.com/r/ceB10E/1 https://regex101.com/r/ceB10E/1

You may put the numbers parts in capturing groups and then replace the entire match with just the captured groups.您可以将数字部分放在捕获组中,然后仅将整个匹配替换为捕获的组。

Try something like:尝试类似:

\b(\d+)-?(\d+)-?(\d)\b

..and replace with: ..并替换为:

\1\2\3

Note that the two non-capturing groups you're using are redundant.请注意,您使用的两个非捕获组是多余的。 (?:-)? = -? = -? . .

Regex demo .正则表达式演示

Python example: Python 示例:

import re

regex = r"\b(\d+)-?(\d+)-?(\d)\b"

test_str = ("123456-78-9\n"
            "123456789")
subst = "\\1\\2\\3"

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Output: Output:

123456789
123456789

Try it online .在线尝试

The easiest thing to do here would be to first use re.sub to remove all non digit characters from the input.这里最简单的做法是首先使用re.sub从输入中删除所有非数字字符。 Then, use an equality comparison to check the input:然后,使用相等比较来检查输入:

inp = "123456-78-9"
if re.sub(r'\D', '', inp) == '123456789':
    print("MATCH")

Edit: If I misunderstood your problem, and instead the inputs could be anything, and you just want to match the two formats given, then use an alternation:编辑:如果我误解了你的问题,而输入可以是任何东西,而你只想匹配给定的两种格式,然后使用替代:

\b(?:\d{6}-\d{2}-\d|\d{9})\b

Script:脚本:

inp = "123456-78-9"
if re.search(r'\b(?:\d{6}-\d{2}-\d|\d{9})\b', inp):
    print("MATCH")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM