I am going to match two cases: 123456-78-9, or 123456789. My goal is to retrieve 123456789 from either case, ie to exclude the '-' from the first case, no need to mention that the second case is quite straightforward.
I have tried to use a regex like r"\b(\d+(?:-)?\d+(?:-)?\d)\b"
, but it still gives '123456-78-9' back to me.
what is the right regex I should use? Though I know do it in two steps: 1) get three parts of digits by regex 2) use another line to concat them, but I still prefer a regex so that the code is more elegant.
Thanks for any advices!
You can use r'(\d{6})(-?)(\d{2})\2(\d)'
Then Join groups 1, 3 and 4, or replace using "\\1\\3\\4"
Will only match these two inputs:
123456-78-9, or 123456789
It's up to you to put boundary conditions on it if needed.
You may put the numbers parts in capturing groups and then replace the entire match with just the captured groups.
Try something like:
\b(\d+)-?(\d+)-?(\d)\b
..and replace with:
\1\2\3
Note that the two non-capturing groups you're using are redundant. (?:-)?
= -?
.
Python example:
import re
regex = r"\b(\d+)-?(\d+)-?(\d)\b"
test_str = ("123456-78-9\n"
"123456789")
subst = "\\1\\2\\3"
result = re.sub(regex, subst, test_str, 0, re.MULTILINE)
if result:
print (result)
Output:
123456789
123456789
The easiest thing to do here would be to first use re.sub
to remove all non digit characters from the input. Then, use an equality comparison to check the input:
inp = "123456-78-9"
if re.sub(r'\D', '', inp) == '123456789':
print("MATCH")
Edit: If I misunderstood your problem, and instead the inputs could be anything, and you just want to match the two formats given, then use an alternation:
\b(?:\d{6}-\d{2}-\d|\d{9})\b
Script:
inp = "123456-78-9"
if re.search(r'\b(?:\d{6}-\d{2}-\d|\d{9})\b', inp):
print("MATCH")
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.