简体   繁体   中英

Regex in Python 3: match everything after a number or optional period but before an optional comma

I'm trying to return ingredients from recipes without any measurements or directions. Ingredients are lists and appear like the following:

['1  medium tomato, cut into 8 wedges',
 '4  c. torn mixed salad greens',
 '1/2  small red onion, sliced and separated into rings',
 '1/4  small cucumber, sliced',
 '1/4  c. sliced pitted ripe olives',
 '2  Tbsp. reduced-calorie Italian salad dressing',
 '2  Tbsp. lemon juice',
 '1  Tbsp. water',
 '1/2  tsp. dried mint, crushed',
 '1/4  c. crumbled Feta cheese or 2 Tbsp. crumbled Blue cheese']

I want to return the following list:

['medium tomato',
 'torn mixed salad greens',
 'small red onion',
 'small cucumber',
 'sliced pitted ripe olives',
 'reduced-calorie Italian salad dressing',
 'lemon juice',
 'water',
 'dried mint',
 'crumbled Blue cheese']

The closest pattern I've found is with:

pattern = '[\s\d\.]* ([^\,]+).*'

but in testing with:

for ing in ingredients:
    print(re.findall(pattern, ing))

the periods after each measurement abbreviation are returned as well, eg:

['c. torn mixed salad greens']

while

pattern = '(?<=\. )[^.]*$'

fails to capture instances with no periods, and captures the comma if both appear, ie:

[]
['torn mixed salad greens']
[]
[]
['sliced pitted ripe olives']
['reduced-calorie Italian salad dressing']
['lemon juice']
['water']
['dried mint, crushed']
['crumbled Blue cheese']

Thank you in advance!

The problem is that you are pairing the number with the dot.

\s\d*\.?

should work to match the number correctly (with or without a dot)

You can use this pattern:

for ing in ingredients:
    print(re.search(r'[a-z][^.,]*(?![^,])(?i)', ing).group())

pattern details:

([a-z][^.,]*) # a substring that starts with a letter and that doesn't contain a period
                # or a comma
(?![^,]) # not followed by a character that is not a comma
         # (in other words, followed by a comma or the end of the string)
(?i)     # make the pattern case insensitive

Description

I'd recommend the following regex to find and replace the substrings you not interested in. By having the unit of measurements spelled out this will also deal with unit of measures which are not abbreviated.

\\s*(?:(?:(?:[0-9]\\s*)?[0-9]+\\/)?[0-9]+\\s*(?:(?:c\\.|cups?|tsp\\.|teaspoon|tbsp\\.|tablespoon)\\s*)?)|,.*|.*\\bor\\b

正则表达式可视化

Replace with: nothing

Examples

Live demo

Showing how the this will match

https://regex101.com/r/qV5iR8/3

Sample string

Note the last line has the double ingredient separated by an or , according to the OP they'd like to eliminate the first ingredient.

1  medium tomato, cut into 8 wedges
4  c. torn mixed salad greens
1/2  small red onion, sliced and separated into rings
1/4  small cucumber, sliced
1 1/4  c. sliced pitted ripe olives
2  Tbsp. reduced-calorie Italian salad dressing
2  Tbsp. lemon juice
1  Tbsp. water
1/2  tsp. dried mint, crushed
1/4  c. crumbled Feta cheese or 2 Tbsp. crumbled Blue cheese

After Replacement

medium tomato
torn mixed salad greens
small red onion
small cucumber
sliced pitted ripe olives
reduced-calorie Italian salad dressing
lemon juice
water
dried mint
crumbled Blue cheese

Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      (?:                      group, but do not capture (optional
                               (matching the most amount possible)):
----------------------------------------------------------------------
        [0-9]                    any character of: '0' to '9'
----------------------------------------------------------------------
        \s*                      whitespace (\n, \r, \t, \f, and " ")
                                 (0 or more times (matching the most
                                 amount possible))
----------------------------------------------------------------------
      )?                       end of grouping
----------------------------------------------------------------------
      [0-9]+                   any character of: '0' to '9' (1 or
                               more times (matching the most amount
                               possible))
----------------------------------------------------------------------
      \/                       '/'
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
    [0-9]+                   any character of: '0' to '9' (1 or more
                             times (matching the most amount
                             possible))
----------------------------------------------------------------------
    \s*                      whitespace (\n, \r, \t, \f, and " ") (0
                             or more times (matching the most amount
                             possible))
----------------------------------------------------------------------
    (?:                      group, but do not capture (optional
                             (matching the most amount possible)):
----------------------------------------------------------------------
      (?:                      group, but do not capture:
----------------------------------------------------------------------
        c                        'c'
----------------------------------------------------------------------
        \.                       '.'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        cup                      'cup'
----------------------------------------------------------------------
        s?                       's' (optional (matching the most
                                 amount possible))
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        tsp                      'tsp'
----------------------------------------------------------------------
        \.                       '.'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        teaspoon                 'teaspoon'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        tbsp                     'tbsp'
----------------------------------------------------------------------
        \.                       '.'
----------------------------------------------------------------------
       |                        OR
----------------------------------------------------------------------
        tablespoon               'tablespoon'
----------------------------------------------------------------------
      )                        end of grouping
----------------------------------------------------------------------
      \s*                      whitespace (\n, \r, \t, \f, and " ")
                               (0 or more times (matching the most
                               amount possible))
----------------------------------------------------------------------
    )?                       end of grouping
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
  ,                        ','
----------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
----------------------------------------------------------------------
 |                        OR
----------------------------------------------------------------------
  .*                       any character except \n (0 or more times
                           (matching the most amount possible))
----------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
----------------------------------------------------------------------
  or                       'or'
----------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
----------------------------------------------------------------------

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM