[英]Python Regex Help - Greedy
我正在尝试编写一个正则表达式来选择文本的范围/段落。 到目前为止我想出的正则表达式选择得太远了,我似乎无法弄清楚如何修复它。 我正在使用的正则表达式是:
SALINAS.+[\s\S]+(^---LETTUCE)+?
我正在尝试拔出
SALINAS-WATSONVILLE CALIFORNIA
Sales F.O.B. Shipping Point and/or Delivered Sales, Shipping Point Basis
VEGETABLES
2022 Season
---BROCCOLI: DEMAND FAIRLY LIGHT. MARKET SLIGHTLY LOWER. Extra services
included. Wide range in quality and condition. cartons bchd 14s 9.55-13.55
mostly 10.00-11.75 few 14.45-14.95 occasional lower bchd 18s 10.05-14.05 mostly
10.50-12.25 few 14.95-15.45 occasional lower 20 lb cartons loose Crown Cut
10.00-14.85 mostly 11.00-12.75 few 15.50-15.95 Short Trim 12.00-15.85 mostly
12.00-13.75 few 16.50-16.95 ORGANIC cartons bchd 14s 14.00-22.50 mostly
16.55-18.75 few 24.95 occasional higher 20 lb cartons loose Crown Cut
18.00-24.75 mostly 20.55-22.75 few 28.50-28.95
---CAULIFLOWER: SUPPLY FAIRLY HEAVY. DEMAND FAIRLY LIGHT. MARKET ABOUT STEADY.
Extra services included. Harvest curtailed by market conditions. cartons film
wrapped White 9s 8.55-11.75 mostly 8.65-10.00 few 12.55 occasional higher and
lower 12s 8.45-12.75 mostly 9.45-10.65 few 13.50-13.55 occasional higher 16s
7.55-11.55 mostly 8.45-9.65 few 12.55 ORGANIC cartons film wrapped White 9s
9.00-16.50 mostly 13.55-15.50 one label 18.95 12s 12.00-17.50 mostly
14.50-15.85 few 18.95 16s 12.00-16.50 mostly 12.00-14.50 one label 18.95
---CELERY: DEMAND MODERATE. MARKET ABOUT STEADY. Extra services included. Wide
range in quality and condition. cartons 2 dz 12.00-16.75 mostly 13.35-15.00 few
17.50-18.55 occasional lower 2 1/2 dz 12.00-16.75 mostly 14.06-15.50 few
17.50-18.55 occasional lower 3 dz 14.06-17.55 mostly 14.06-16.65 one label
18.45 cartons film bags Hearts 18s 17.06-20.55 mostly 17.50-19.06 few
21.55-22.55 ORGANIC cartons 2 dz 14.00-17.50 mostly 14.50-16.75 few 18.50 one
label 20.95 2 1/2 dz 14.50-17.50 mostly 14.50-16.75 few 18.50-18.56 one label
20.95 cartons film bags Hearts 18s 14.50-18.95 mostly 16.00-17.85 occasional
higher
---LETTUCE-ICEBERG: DEMAND FAIRLY GOOD. MARKET SLIGHTLY HIGHER. Extra services
included. Wide range in quality and condition. cartons flm lined 24s
15.55-18.75 mostly 15.55-17.50 few 19.00-19.65 few 12.00-12.95 24s flmwrpd
16.55-19.75 mostly 16.55-18.50 few 20.00-20.65 few 13.00-13.95 30s flmwrpd
14.00-15.55 mostly 14.00-14.75 occasional higher ORGANIC cartons 24s flmwrpd
16.00-20.50 mostly 16.00-18.50 12s flmwrpd 10.00-12.95 mostly 10.00-11.85
---LETTUCE-OTHER: DEMAND FAIRLY LIGHT. MARKET ABOUT STEADY. Extra services
included. Wide range in quality and condition. cartons Boston 24s 10.50-14.75
mostly 10.50-12.55 few 15.50-15.75 Green Leaf 24s 8.56-11.95 mostly 9.50-10.65
few 12.06-12.50 one label 15.75 Red Leaf 24s 8.56-11.75 mostly 9.50-10.65 few
12.05-12.75 ORGANIC cartons Green Leaf 24s 12.00-16.75 mostly 12.00-14.75 Red
Leaf 24s 12.00-16.75 mostly 12.00-14.75
---LETTUCE-ROMAINE: DEMAND HEARTS FAIRLY LIGHT, 24S MODERATE. MARKET ABOUT
STEADY. Extra services included. Wide range in quality and condition. cartons
24s 10.00-13.75 mostly 10.15-11.95 occasional higher cartons 12 3-count
packages Hearts 12.85-17.95 mostly 14.50-16.75 few 18.05-18.75 occasional
higher cartons film lined Hearts 48s 13.85-19.95 mostly 16.50-18.75 few
20.00-20.75 ORGANIC cartons 24s 16.50-18.55 few 20.00-20.75 cartons 12 3-count
packages Hearts 13.85-20.50 mostly 17.55-18.95 few 22.50-22.75
从文本文件中找到: https ://www.ams.usda.gov/mnreports/ix_fv120.txt
如果包括?
在像.*
这样的搜索语句之后,它使它变得懒惰(而不是贪婪)。 下面的正则表达式(来自Regex101 )仅匹配段落。 您可以修改正则表达式以匹配任何其他城市的蔬菜市场库存事物 (?) 或其他任何事物。
SALINAS-WATSONVILLE CALIFORNIA(.|\n)*?MARKET(.|\n)*?\n\n (flags: gm)
只需更改名称 ( SALINAS-WATSONVILLE CALIFORNIA
) 即可更改您选择的城市市场。 另请注意,这包括末尾的两个换行符(以及大量尾随空格)。 至于换行符,只需在最后\n\n
之前创建一个组,然后选择该组(组 1)。 请参阅Regex101 链接。
(SALINAS-WATSONVILLE CALIFORNIA(.|\n)*?MARKET(.|\n)*?)\n\n (flags: gm)
正则表达式非常适合匹配模式,但是由于您可能正在使用 python(我从标题中推断出)并且一次处理多行,所以我会选择一种更简单的方法。
您可以尝试以下代码:
# This is the location for which you want to extract the paragraph
location_to_find = 'SALINAS'
# Read all lines into a list. Each line ends with a '\n'
with open('ix_fv120.txt') as fr:
lines = fr.readlines()
# Look for all occurrences of "Sales F.O.B." and get their location (i.e. line number). Because
# paragraphs start 2 lines before each occurrence of "Sales F.O.B.", we collect values of "n-2".
paragraph_poss = [(n-2) for n,line in enumerate(lines) if line.startswith('Sales F.O.B.')]
# Now search only in lines having a location name to see which one of them is for the location
# you are looking for, e.g. "SALINAS".
for n,cur_paragraph_line in enumerate(paragraph_poss):
if location_to_find in lines[cur_paragraph_line]:
if n == len(paragraph_poss)-1:
paragraph = ''.join(lines[paragraph_poss[n]:])
else:
paragraph = ''.join(lines[paragraph_poss[n]:paragraph_poss[n+1]])
print(paragraph)
break
else:
print('Error: Could not find "' + location + '" at the begining of any paragraph.')
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.