繁体   English   中英

Python 正则表达式帮助 - 贪婪

[英]Python Regex Help - Greedy

我正在尝试编写一个正则表达式来选择文本的范围/段落。 到目前为止我想出的正则表达式选择得太远了,我似乎无法弄清楚如何修复它。 我正在使用的正则表达式是:

SALINAS.+[\s\S]+(^---LETTUCE)+?

我正在尝试拔出

SALINAS-WATSONVILLE CALIFORNIA                                                 
                                                                               
Sales F.O.B. Shipping Point and/or Delivered Sales, Shipping Point Basis 

VEGETABLES                                                                     
2022 Season                                                                    
---BROCCOLI: DEMAND FAIRLY LIGHT. MARKET SLIGHTLY LOWER. Extra services        
included. Wide range in quality and condition. cartons bchd 14s 9.55-13.55     
mostly 10.00-11.75 few 14.45-14.95 occasional lower bchd 18s 10.05-14.05 mostly
10.50-12.25 few 14.95-15.45 occasional lower 20 lb cartons loose Crown Cut     
10.00-14.85 mostly 11.00-12.75 few 15.50-15.95 Short Trim 12.00-15.85 mostly   
12.00-13.75 few 16.50-16.95 ORGANIC cartons bchd 14s 14.00-22.50 mostly        
16.55-18.75 few 24.95 occasional higher 20 lb cartons loose Crown Cut          
18.00-24.75 mostly 20.55-22.75 few 28.50-28.95                                 
---CAULIFLOWER: SUPPLY FAIRLY HEAVY. DEMAND FAIRLY LIGHT. MARKET ABOUT STEADY. 
Extra services included. Harvest curtailed by market conditions. cartons film  
wrapped White   9s 8.55-11.75 mostly 8.65-10.00 few 12.55 occasional higher and
lower 12s 8.45-12.75 mostly 9.45-10.65 few 13.50-13.55 occasional higher 16s   
7.55-11.55 mostly 8.45-9.65 few 12.55 ORGANIC cartons film wrapped White 9s    
9.00-16.50 mostly 13.55-15.50 one label 18.95 12s 12.00-17.50 mostly           
14.50-15.85 few 18.95 16s 12.00-16.50 mostly 12.00-14.50 one label 18.95       
---CELERY: DEMAND MODERATE. MARKET ABOUT STEADY. Extra services included. Wide 
range in quality and condition. cartons 2 dz 12.00-16.75 mostly 13.35-15.00 few
17.50-18.55 occasional lower 2 1/2 dz 12.00-16.75 mostly 14.06-15.50 few       
17.50-18.55 occasional lower 3 dz 14.06-17.55 mostly 14.06-16.65 one label     
18.45 cartons film bags Hearts 18s 17.06-20.55 mostly 17.50-19.06 few          
21.55-22.55 ORGANIC cartons 2 dz 14.00-17.50 mostly 14.50-16.75 few 18.50 one  
label 20.95 2 1/2 dz 14.50-17.50 mostly 14.50-16.75 few 18.50-18.56 one label  
20.95 cartons film bags Hearts 18s 14.50-18.95 mostly 16.00-17.85 occasional   
higher                                                                         
---LETTUCE-ICEBERG: DEMAND FAIRLY GOOD. MARKET SLIGHTLY HIGHER. Extra services 
included. Wide range in quality and condition. cartons flm lined 24s           
15.55-18.75 mostly 15.55-17.50 few 19.00-19.65 few 12.00-12.95 24s flmwrpd     
16.55-19.75 mostly 16.55-18.50 few 20.00-20.65 few 13.00-13.95 30s flmwrpd     
14.00-15.55 mostly 14.00-14.75 occasional higher ORGANIC cartons 24s flmwrpd   
16.00-20.50 mostly 16.00-18.50 12s flmwrpd 10.00-12.95 mostly 10.00-11.85      
---LETTUCE-OTHER: DEMAND FAIRLY LIGHT. MARKET ABOUT STEADY. Extra services     
included.  Wide range in quality and condition. cartons Boston 24s 10.50-14.75 
mostly 10.50-12.55 few 15.50-15.75 Green Leaf 24s 8.56-11.95 mostly 9.50-10.65 
few 12.06-12.50 one label 15.75 Red Leaf 24s 8.56-11.75 mostly 9.50-10.65 few  
12.05-12.75 ORGANIC cartons Green Leaf 24s 12.00-16.75 mostly 12.00-14.75 Red  
Leaf 24s 12.00-16.75 mostly 12.00-14.75                                        
---LETTUCE-ROMAINE: DEMAND HEARTS FAIRLY LIGHT, 24S MODERATE. MARKET ABOUT     
STEADY. Extra services included. Wide range in quality and condition. cartons  
24s 10.00-13.75 mostly 10.15-11.95 occasional higher cartons 12 3-count        
packages Hearts 12.85-17.95 mostly 14.50-16.75 few 18.05-18.75 occasional      
higher cartons film lined Hearts 48s 13.85-19.95 mostly 16.50-18.75 few        
20.00-20.75 ORGANIC cartons 24s 16.50-18.55  few 20.00-20.75 cartons 12 3-count
packages Hearts 13.85-20.50 mostly 17.55-18.95 few 22.50-22.75 

从文本文件中找到: https ://www.ams.usda.gov/mnreports/ix_fv120.txt

如果包括? 在像.*这样的搜索语句之后,它使它变得懒惰(而不是贪婪)。 下面的正则表达式(来自Regex101 )仅匹配段落。 您可以修改正则表达式以匹配任何其他城市的蔬菜市场库存事物 (?) 或其他任何事物。

SALINAS-WATSONVILLE CALIFORNIA(.|\n)*?MARKET(.|\n)*?\n\n (flags: gm)

只需更改名称 ( SALINAS-WATSONVILLE CALIFORNIA ) 即可更改您选择的城市市场。 另请注意,这包括末尾的两个换行符(以及大量尾随空格)。 至于换行符,只需在最后\n\n之前创建一个组,然后选择该组(组 1)。 请参阅Regex101 链接

(SALINAS-WATSONVILLE CALIFORNIA(.|\n)*?MARKET(.|\n)*?)\n\n (flags: gm)

正则表达式非常适合匹配模式,但是由于您可能正在使用 python(我从标题中推断出)并且一次处理多行,所以我会选择一种更简单的方法。

您可以尝试以下代码:

# This is the location for which you want to extract the paragraph
location_to_find = 'SALINAS'

# Read all lines into a list. Each line ends with a '\n'
with open('ix_fv120.txt') as fr:
  lines = fr.readlines()

# Look for all occurrences of "Sales F.O.B." and get their location (i.e. line number). Because
# paragraphs start 2 lines before each occurrence of "Sales F.O.B.", we collect values of "n-2".
paragraph_poss = [(n-2)  for n,line in enumerate(lines)  if line.startswith('Sales F.O.B.')]

# Now search only in lines having a location name to see which one of them is for the location
# you are looking for, e.g. "SALINAS".
for n,cur_paragraph_line in enumerate(paragraph_poss):
  if location_to_find in lines[cur_paragraph_line]:
    if n == len(paragraph_poss)-1:
      paragraph = ''.join(lines[paragraph_poss[n]:])
    else:
      paragraph = ''.join(lines[paragraph_poss[n]:paragraph_poss[n+1]])
    print(paragraph)
    break
else:
  print('Error: Could not find "' + location + '" at the begining of any paragraph.')

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM