I'm trying to return ingredients from recipes without any measurements or directions. Ingredients are lists and appear like the following:
['1 medium tomato, cut into 8 wedges',
'4 c. torn mixed salad greens',
'1/2 small red onion, sliced and separated into rings',
'1/4 small cucumber, sliced',
'1/4 c. sliced pitted ripe olives',
'2 Tbsp. reduced-calorie Italian salad dressing',
'2 Tbsp. lemon juice',
'1 Tbsp. water',
'1/2 tsp. dried mint, crushed',
'1/4 c. crumbled Feta cheese or 2 Tbsp. crumbled Blue cheese']
I want to return the following list:
['medium tomato',
'torn mixed salad greens',
'small red onion',
'small cucumber',
'sliced pitted ripe olives',
'reduced-calorie Italian salad dressing',
'lemon juice',
'water',
'dried mint',
'crumbled Blue cheese']
The closest pattern I've found is with:
pattern = '[\s\d\.]* ([^\,]+).*'
but in testing with:
for ing in ingredients:
print(re.findall(pattern, ing))
the periods after each measurement abbreviation are returned as well, eg:
['c. torn mixed salad greens']
while
pattern = '(?<=\. )[^.]*$'
fails to capture instances with no periods, and captures the comma if both appear, ie:
[]
['torn mixed salad greens']
[]
[]
['sliced pitted ripe olives']
['reduced-calorie Italian salad dressing']
['lemon juice']
['water']
['dried mint, crushed']
['crumbled Blue cheese']
Thank you in advance!
The problem is that you are pairing the number with the dot.
\s\d*\.?
should work to match the number correctly (with or without a dot)
You can use this pattern:
for ing in ingredients:
print(re.search(r'[a-z][^.,]*(?![^,])(?i)', ing).group())
pattern details:
([a-z][^.,]*) # a substring that starts with a letter and that doesn't contain a period
# or a comma
(?![^,]) # not followed by a character that is not a comma
# (in other words, followed by a comma or the end of the string)
(?i) # make the pattern case insensitive
I'd recommend the following regex to find and replace the substrings you not interested in. By having the unit of measurements spelled out this will also deal with unit of measures which are not abbreviated.
\\s*(?:(?:(?:[0-9]\\s*)?[0-9]+\\/)?[0-9]+\\s*(?:(?:c\\.|cups?|tsp\\.|teaspoon|tbsp\\.|tablespoon)\\s*)?)|,.*|.*\\bor\\b
Replace with: nothing
Live demo
Showing how the this will match
https://regex101.com/r/qV5iR8/3
Sample string
Note the last line has the double ingredient separated by an or
, according to the OP they'd like to eliminate the first ingredient.
1 medium tomato, cut into 8 wedges
4 c. torn mixed salad greens
1/2 small red onion, sliced and separated into rings
1/4 small cucumber, sliced
1 1/4 c. sliced pitted ripe olives
2 Tbsp. reduced-calorie Italian salad dressing
2 Tbsp. lemon juice
1 Tbsp. water
1/2 tsp. dried mint, crushed
1/4 c. crumbled Feta cheese or 2 Tbsp. crumbled Blue cheese
After Replacement
medium tomato
torn mixed salad greens
small red onion
small cucumber
sliced pitted ripe olives
reduced-calorie Italian salad dressing
lemon juice
water
dried mint
crumbled Blue cheese
NODE EXPLANATION
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
[0-9] any character of: '0' to '9'
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or
more times (matching the most amount
possible))
----------------------------------------------------------------------
\/ '/'
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
[0-9]+ any character of: '0' to '9' (1 or more
times (matching the most amount
possible))
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ") (0
or more times (matching the most amount
possible))
----------------------------------------------------------------------
(?: group, but do not capture (optional
(matching the most amount possible)):
----------------------------------------------------------------------
(?: group, but do not capture:
----------------------------------------------------------------------
c 'c'
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
cup 'cup'
----------------------------------------------------------------------
s? 's' (optional (matching the most
amount possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
tsp 'tsp'
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
teaspoon 'teaspoon'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
tbsp 'tbsp'
----------------------------------------------------------------------
\. '.'
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
tablespoon 'tablespoon'
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
\s* whitespace (\n, \r, \t, \f, and " ")
(0 or more times (matching the most
amount possible))
----------------------------------------------------------------------
)? end of grouping
----------------------------------------------------------------------
) end of grouping
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
, ','
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
| OR
----------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
or 'or'
----------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
----------------------------------------------------------------------
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.