如何从文本匹配组中排除某些字符？

Question

I am going to match two cases: 123456-78-9, or 123456789. My goal is to retrieve 123456789 from either case, ie to exclude the '-' from the first case, no need to mention that the second case is quite straightforward.我将匹配两种情况：123456-78-9 或 123456789。我的目标是从任何一种情况中检索 123456789，即从第一种情况中排除“-”，无需提及第二种情况非常简单.

I have tried to use a regex like r"\b(\d+(?:-)?\d+(?:-)?\d)\b" , but it still gives '123456-78-9' back to me.我尝试使用像r"\b(\d+(?:-)?\d+(?:-)?\d)\b"这样的正则表达式，但它仍然给我 '123456-78-9' .

what is the right regex I should use?我应该使用什么正确的正则表达式？ Though I know do it in two steps: 1) get three parts of digits by regex 2) use another line to concat them, but I still prefer a regex so that the code is more elegant.虽然我知道分两步完成：1）通过正则表达式获取三部分数字 2）使用另一行来连接它们，但我仍然更喜欢正则表达式，以便代码更优雅。

Thanks for any advices!感谢您的任何建议！

Answer 1

You can use r'(\d{6})(-?)(\d{2})\2(\d)'您可以使用r'(\d{6})(-?)(\d{2})\2(\d)'
Then Join groups 1, 3 and 4, or replace using "\\1\\3\\4"然后加入组 1、3 和 4，或使用"\\1\\3\\4"替换

Will only match these two inputs:只会匹配这两个输入：

123456-78-9, or 123456789 123456-78-9，或 123456789

It's up to you to put boundary conditions on it if needed.如果需要，您可以为其设置边界条件。

https://regex101.com/r/ceB10E/1 https://regex101.com/r/ceB10E/1

Answer 2

You may put the numbers parts in capturing groups and then replace the entire match with just the captured groups.您可以将数字部分放在捕获组中，然后仅将整个匹配替换为捕获的组。

Try something like:尝试类似：

\b(\d+)-?(\d+)-?(\d)\b

..and replace with: ..并替换为：

\1\2\3

Note that the two non-capturing groups you're using are redundant.请注意，您使用的两个非捕获组是多余的。 (?:-)? = -? = -? . .

Regex demo .正则表达式演示。

Python example: Python 示例：

import re

regex = r"\b(\d+)-?(\d+)-?(\d)\b"

test_str = ("123456-78-9\n"
            "123456789")
subst = "\\1\\2\\3"

result = re.sub(regex, subst, test_str, 0, re.MULTILINE)

if result:
    print (result)

Output: Output：

123456789
123456789

Try it online .在线尝试。

Answer 3

The easiest thing to do here would be to first use re.sub to remove all non digit characters from the input.这里最简单的做法是首先使用re.sub从输入中删除所有非数字字符。 Then, use an equality comparison to check the input:然后，使用相等比较来检查输入：

inp = "123456-78-9"
if re.sub(r'\D', '', inp) == '123456789':
    print("MATCH")

Edit: If I misunderstood your problem, and instead the inputs could be anything, and you just want to match the two formats given, then use an alternation:编辑：如果我误解了你的问题，而输入可以是任何东西，而你只想匹配给定的两种格式，然后使用替代：

\b(?:\d{6}-\d{2}-\d|\d{9})\b

Script:脚本：

inp = "123456-78-9"
if re.search(r'\b(?:\d{6}-\d{2}-\d|\d{9})\b', inp):
    print("MATCH")

如何从文本匹配组中排除某些字符？

问题描述

3 个解决方案

解决方案1
1 2019-09-28 18:06:48

解决方案2
0 已采纳 2019-09-28 15:15:56

解决方案3
0 2019-09-28 15:18:20

如何从文本匹配组中排除某些字符？

问题描述

3 个解决方案

解决方案1 1 2019-09-28 18:06:48

解决方案2 0 已采纳 2019-09-28 15:15:56

解决方案3 0 2019-09-28 15:18:20

解决方案1
1 2019-09-28 18:06:48

解决方案2
0 已采纳 2019-09-28 15:15:56

解决方案3
0 2019-09-28 15:18:20