简体   繁体   中英

Regular expression to extract number with hyphen

The text is like "1-2years. 3years. 10years."

I want get result [(1,2),(3),(10)] .

I use python.

I first tried r"([0-9]?)[-]?([0-9])years" . It works well except for the case of 10. I also tried r"([0-9]?)[-]?([0-9]|10)years" but the result is still [(1,2),(3),(1,0)] .

This should work:

import re

st = '1-2years. 3years. 10years.'
result = [tuple(e for e in tup if e) 
          for tup in re.findall(r'(?:(\d+)-(\d+)|(\d+))years', st)]
# [('1', '2'), ('3',), ('10',)]

The regex will look for either one number, or two separated by a hyphen, immediately prior to the word years . If we give this to re.findall() , it will give us the output [('1', '2', ''), ('', '', '3'), ('', '', '10')] , so we also use a quick list comprehension to filter out the empty strings.

Alternately we could use r'(\d+)(?:-(\d+))?years' to basically the same effect, which is closer to what you've already tried.

Your attempt r"([0-9]?)[-]?([0-9])years" doesn't work for the case of 10 because you ask it to match one (or zero) digit per group.

You also don't need the hyphen in brackets.

This should work: Regex101

(\d+)(?:-(\d+))?years

Explanation:

  • (\d+) : Capturing group for one or more digits
  • (?: ) : Non-capturing group
  • - : hyphen
  • (\d+) : Capturing group for one or more digits
  • (?: )? : Make the previous non-capturing group optional

In python:

import re

result = re.findall(r"(\d+)(?:-(\d+))?years", "1-2years. 3years. 10years.")

# Gives: [('1', '2'), ('3', ''), ('10', '')]

Each tuple in the list contains two elements: The number on the left side of the hyphen, and the number on the right side of the hyphen. Removing the blank elements is quite easy: you loop over each item in result , then you loop over each match in this item and only select it (and convert it to int ) if it is not empty.

final_result = [tuple(int(match) for match in item if match) for item in result]

# gives: [(1, 2), (3,), (10,)]

You can use this pattern: (?:(\d+)-)?(\d+)years

See Regex Demo

Code:

import re

pattern = r"(?:(\d+)-)?(\d+)years"
text = "1-2years. 3years. 10years."
print([tuple(int(z) for z in x if z) for x in re.findall(pattern, text)])

Output:

[(1, 2), (3,), (10,)]

You only match a single digit as the character class [0-9] is not repeated.

Another option is to match the first digits with an optional part for - and digits.

Then you can split the matches on -

\b(\d+(?:-\d+)?)years\.
  • \b A word boundary
  • ( Capture group 1 (which will be returned by re.findall)
    • \d+(?:-\d+)? Match 1+ digits and optionally match - and again 1+ digits
  • ) Close group 1
  • years\. Match literally with the escaped .

See a regex demo and a Python demo .

Example

pattern = r"\b(\d+(?:-\d+)?)years\."
s = "1-2years. 3years. 10years."

res = [tuple(v.split('-')) for v in re.findall(pattern, s)]
print(res)

Output

[('1', '2'), ('3',), ('10',)]

Or if a list of lists is also ok instead of tuples

res = [v.split('-') for v in re.findall(pattern, s)]

Output

[['1', '2'], ['3'], ['10']]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM