簡體   English   中英

Python 正則表達式幫助 - 貪婪

[英]Python Regex Help - Greedy

我正在嘗試編寫一個正則表達式來選擇文本的范圍/段落。 到目前為止我想出的正則表達式選擇得太遠了,我似乎無法弄清楚如何修復它。 我正在使用的正則表達式是:

SALINAS.+[\s\S]+(^---LETTUCE)+?

我正在嘗試拔出

SALINAS-WATSONVILLE CALIFORNIA                                                 
                                                                               
Sales F.O.B. Shipping Point and/or Delivered Sales, Shipping Point Basis 

VEGETABLES                                                                     
2022 Season                                                                    
---BROCCOLI: DEMAND FAIRLY LIGHT. MARKET SLIGHTLY LOWER. Extra services        
included. Wide range in quality and condition. cartons bchd 14s 9.55-13.55     
mostly 10.00-11.75 few 14.45-14.95 occasional lower bchd 18s 10.05-14.05 mostly
10.50-12.25 few 14.95-15.45 occasional lower 20 lb cartons loose Crown Cut     
10.00-14.85 mostly 11.00-12.75 few 15.50-15.95 Short Trim 12.00-15.85 mostly   
12.00-13.75 few 16.50-16.95 ORGANIC cartons bchd 14s 14.00-22.50 mostly        
16.55-18.75 few 24.95 occasional higher 20 lb cartons loose Crown Cut          
18.00-24.75 mostly 20.55-22.75 few 28.50-28.95                                 
---CAULIFLOWER: SUPPLY FAIRLY HEAVY. DEMAND FAIRLY LIGHT. MARKET ABOUT STEADY. 
Extra services included. Harvest curtailed by market conditions. cartons film  
wrapped White   9s 8.55-11.75 mostly 8.65-10.00 few 12.55 occasional higher and
lower 12s 8.45-12.75 mostly 9.45-10.65 few 13.50-13.55 occasional higher 16s   
7.55-11.55 mostly 8.45-9.65 few 12.55 ORGANIC cartons film wrapped White 9s    
9.00-16.50 mostly 13.55-15.50 one label 18.95 12s 12.00-17.50 mostly           
14.50-15.85 few 18.95 16s 12.00-16.50 mostly 12.00-14.50 one label 18.95       
---CELERY: DEMAND MODERATE. MARKET ABOUT STEADY. Extra services included. Wide 
range in quality and condition. cartons 2 dz 12.00-16.75 mostly 13.35-15.00 few
17.50-18.55 occasional lower 2 1/2 dz 12.00-16.75 mostly 14.06-15.50 few       
17.50-18.55 occasional lower 3 dz 14.06-17.55 mostly 14.06-16.65 one label     
18.45 cartons film bags Hearts 18s 17.06-20.55 mostly 17.50-19.06 few          
21.55-22.55 ORGANIC cartons 2 dz 14.00-17.50 mostly 14.50-16.75 few 18.50 one  
label 20.95 2 1/2 dz 14.50-17.50 mostly 14.50-16.75 few 18.50-18.56 one label  
20.95 cartons film bags Hearts 18s 14.50-18.95 mostly 16.00-17.85 occasional   
higher                                                                         
---LETTUCE-ICEBERG: DEMAND FAIRLY GOOD. MARKET SLIGHTLY HIGHER. Extra services 
included. Wide range in quality and condition. cartons flm lined 24s           
15.55-18.75 mostly 15.55-17.50 few 19.00-19.65 few 12.00-12.95 24s flmwrpd     
16.55-19.75 mostly 16.55-18.50 few 20.00-20.65 few 13.00-13.95 30s flmwrpd     
14.00-15.55 mostly 14.00-14.75 occasional higher ORGANIC cartons 24s flmwrpd   
16.00-20.50 mostly 16.00-18.50 12s flmwrpd 10.00-12.95 mostly 10.00-11.85      
---LETTUCE-OTHER: DEMAND FAIRLY LIGHT. MARKET ABOUT STEADY. Extra services     
included.  Wide range in quality and condition. cartons Boston 24s 10.50-14.75 
mostly 10.50-12.55 few 15.50-15.75 Green Leaf 24s 8.56-11.95 mostly 9.50-10.65 
few 12.06-12.50 one label 15.75 Red Leaf 24s 8.56-11.75 mostly 9.50-10.65 few  
12.05-12.75 ORGANIC cartons Green Leaf 24s 12.00-16.75 mostly 12.00-14.75 Red  
Leaf 24s 12.00-16.75 mostly 12.00-14.75                                        
---LETTUCE-ROMAINE: DEMAND HEARTS FAIRLY LIGHT, 24S MODERATE. MARKET ABOUT     
STEADY. Extra services included. Wide range in quality and condition. cartons  
24s 10.00-13.75 mostly 10.15-11.95 occasional higher cartons 12 3-count        
packages Hearts 12.85-17.95 mostly 14.50-16.75 few 18.05-18.75 occasional      
higher cartons film lined Hearts 48s 13.85-19.95 mostly 16.50-18.75 few        
20.00-20.75 ORGANIC cartons 24s 16.50-18.55  few 20.00-20.75 cartons 12 3-count
packages Hearts 13.85-20.50 mostly 17.55-18.95 few 22.50-22.75 

從文本文件中找到: https ://www.ams.usda.gov/mnreports/ix_fv120.txt

如果包括? 在像.*這樣的搜索語句之后,它使它變得懶惰(而不是貪婪)。 下面的正則表達式(來自Regex101 )僅匹配段落。 您可以修改正則表達式以匹配任何其他城市的蔬菜市場庫存事物 (?) 或其他任何事物。

SALINAS-WATSONVILLE CALIFORNIA(.|\n)*?MARKET(.|\n)*?\n\n (flags: gm)

只需更改名稱 ( SALINAS-WATSONVILLE CALIFORNIA ) 即可更改您選擇的城市市場。 另請注意,這包括末尾的兩個換行符(以及大量尾隨空格)。 至於換行符,只需在最后\n\n之前創建一個組,然后選擇該組(組 1)。 請參閱Regex101 鏈接

(SALINAS-WATSONVILLE CALIFORNIA(.|\n)*?MARKET(.|\n)*?)\n\n (flags: gm)

正則表達式非常適合匹配模式,但是由於您可能正在使用 python(我從標題中推斷出)並且一次處理多行,所以我會選擇一種更簡單的方法。

您可以嘗試以下代碼:

# This is the location for which you want to extract the paragraph
location_to_find = 'SALINAS'

# Read all lines into a list. Each line ends with a '\n'
with open('ix_fv120.txt') as fr:
  lines = fr.readlines()

# Look for all occurrences of "Sales F.O.B." and get their location (i.e. line number). Because
# paragraphs start 2 lines before each occurrence of "Sales F.O.B.", we collect values of "n-2".
paragraph_poss = [(n-2)  for n,line in enumerate(lines)  if line.startswith('Sales F.O.B.')]

# Now search only in lines having a location name to see which one of them is for the location
# you are looking for, e.g. "SALINAS".
for n,cur_paragraph_line in enumerate(paragraph_poss):
  if location_to_find in lines[cur_paragraph_line]:
    if n == len(paragraph_poss)-1:
      paragraph = ''.join(lines[paragraph_poss[n]:])
    else:
      paragraph = ''.join(lines[paragraph_poss[n]:paragraph_poss[n+1]])
    print(paragraph)
    break
else:
  print('Error: Could not find "' + location + '" at the begining of any paragraph.')

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM