简体   繁体   中英

Using regex extract all digit and word numbers

I am trying to extract all string and digit numbers from a text.

text = 'one tweo three 10 number'
numbers = "(^a(?=\s)|one|two|three|four|five|six|seven|eight|nine|ten| \
          eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen| \
          eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty| \
          ninety|hundred|thousand)"

print re.search(numbers, text).group(0)

This gives me first words digit.

my expected result = ['one', 'two', 'three', '10']

How can I modify it so that all words and well digit numbers I Can get in list?

There are several issues here:

  • The pattern should be used with the VERBOSE flag (add (?x) at the start)
  • The nine will match nine in ninety , so you should either put the longer values first, or use word boundaries \\b
  • Declare the pattern with a raw string literal to avoid issues like parsing \\b as a backspace and not a word boundary
  • To match digits, you may add a |\\d+ branch to your number matching group
  • To match multiple non-overlapping occurrences of the substrings inside the input string, you need to use re.findall (or re.finditer ), not re.search .

Here is my suggestion:

import re
text = 'one two three 10 number eleven eighteen ninety  \n '
numbers = r"""(?x)          # Turn on free spacing mode
            (
              ^a(?=\s)|     # Here we match a at the start of string before  whitespace
              \d+|          # HERE we match one or more digits
              \b            # Initial word boundary 
              (?:
                  one|two|three|four|five|six|seven|eight|nine|ten| 
                  eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen| 
                  eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty| 
                  ninety|hundred|thousand
              )             # A list of alternatives
              \b            # Trailing word boundary
)"""

print(re.findall(numbers, text))

See Python demo

And here is a regex demo .

Well the re.findall and the add of [0-9]+ work well for your list. Unfortunately if you try to match something like seventythree you will get --> seven and three, thus you need something better than this below :-)

numbers = "(^a(?=\s)|one|two|three|four|five|six|seven|eight|nine|ten| \
          eleven|twelve|thirteen|fourteen|fifteen|sixteen|seventeen| \
          eighteen|nineteen|twenty|thirty|forty|fifty|sixty|seventy|eighty| \
          ninety|hundred|thousand|[0-9]+)"

x = re.findall(numbers, text)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM