繁体   English   中英

在 python 中使用正则表达式识别字符串中的模式

[英]Identifying patterns in strings using regex in python

这是我正在使用的代码

import re

sample = 'The 17 services industries reporting growth in February — listed in order — are: Accommodation & Food Services; Wholesale Trade; Transportation & Warehousing; Construction; Arts, Entertainment & Recreation; Public Administration; Utilities; Health Care & Social Assistance; Retail Trade; Professional, Scientific & Technical Services; Finance & Insurance; Management of Companies & Support Services; Information; Agriculture, Forestry, Fishing & Hunting; Educational Services; Other Services; and Mining. The only industry reporting contraction in February is Real Estate, Rental & Leasing.'
#Find the growth industries
growth_pattern = 'growth.*?:(.*?)\.'
growths = re.findall(growth_pattern,sample)
growths = growths[0].strip().split(';') if len(growths) == 1 else []

#Find the no change industries
nochange_pattern = 'no change.*?:(.*?)\.'
nochanges = re.findall(nochange_pattern,sample)
nochanges = nochanges[0].strip().split(';') if len(nochanges) == 1 else []

#Find the contraction industries
contraction_pattern = 'contraction.*?:(.*?)\.'
contractions = re.findall(contraction_pattern,sample)
contractions = contractions[0].strip().split(';') if len(contractions) == 1 else []

decrease_pattern = 'decrease.*?:(.*?)\.'
decreases = re.findall(decrease_pattern,sample)
decreases = decreases[0].strip().split(';') if len(decreases) == 1 else []

#Give numbers to each of the industries
growths = [(g.strip().replace('and ',''),len(growths)-i) for i,g in enumerate(growths)]
nochanges = [(nc.strip().replace('and ',''),0) for i,nc in enumerate(nochanges)]
contractions = [(c.strip().replace('and ',''),-(len(contractions)-i)) for i,c in enumerate(contractions)]
decreases = [(c.strip().replace('and ',''),-(len(decreases)-i)) for i,c in enumerate(decreases)]

#Print them out to check (commented out for now)
#print('growths:'+str(growths))
#print('nochanges:'+str(nochanges))
#print('contractions:'+str(contractions))

#Combine them all together, sort by value, and print out
all_together = growths+nochanges+contractions+decreases
all_together_ser = sorted(all_together,key=lambda x: -x[1])
print all_together_ser

在这个growth_pattern 子字符串中,我可以得到':' 和'.' 之间的所有内容。 这有效。

growth_pattern = 'growth.*?:(.*?)\.'

但是对于 reduction_pattern 子字符串,字符串模式更改为 'is' 而不是 'are:' 因为它是单数的。

如何在这个正则表达式中同时检查 'is' 和 ':' 模式。 喜欢非此即彼的条件?

正确的 Output 必须是

[('Accommodation & Food Services', 17), ('Wholesale Trade', 16), ('Transportation & Warehousing', 15), ('Construction', 14), ('Arts, Entertainment & Recreation', 13), ('Public Administration', 12), ('Utilities', 11), ('Health Care & Social Assistance', 10), ('Retail Trade', 9), ('Professional, Scientific & Technical Services', 8), ('Finance & Insurance', 7), ('Management of Companies & Support Services', 6), ('Information', 5), ('Agriculture, Forestry, Fishing & Hunting', 4), ('Educational Services', 3), ('Other Services', 2), ('Mining', 1),('Real Estate, Rental & Leasing',-1)]

但是我们得到的 Output 是

[('Accommodation & Food Services', 17), ('Wholesale Trade', 16), ('Transportation & Warehousing', 15), ('Construction', 14), ('Arts, Entertainment & Recreation', 13), ('Public Administration', 12), ('Utilities', 11), ('Health Care & Social Assistance', 10), ('Retail Trade', 9), ('Professional, Scientific & Technical Services', 8), ('Finance & Insurance', 7), ('Management of Companies & Support Services', 6), ('Information', 5), ('Agriculture, Forestry, Fishing & Hunting', 4), ('Educational Services', 3), ('Other Services', 2), ('Mining', 1)]

最后一个元组不是从输入文本中选择的。

我更改了您的代码以减少所有步骤。 主要是您缺少正则表达式“或”运算符,即| . 利用这一点,您可以在一个 go 中查找所有行业类型,无论它们是单数还是复数。 然后,您将它们保存在按类型分隔的字典中,以便稍后分配数字。 让我知道这是否有帮助,或者您需要进一步澄清!


import re

sample = 'The 17 services industries reporting growth in February — listed in order — are: Accommodation & Food Services; Wholesale Trade; Transportation & Warehousing; Construction; Arts, Entertainment & Recreation; Public Administration; Utilities; Health Care & Social Assistance; Retail Trade; Professional, Scientific & Technical Services; Finance & Insurance; Management of Companies & Support Services; Information; Agriculture, Forestry, Fishing & Hunting; Educational Services; Other Services; and Mining. The only industry reporting contraction in February is Real Estate, Rental & Leasing.'

full_pattern = '(growth|contraction|no change|decrease).*?(:| is )(.*?)\.'
found=re.findall(full_pattern,sample)
dic = {industryType[0]:industryType[-1] for industryType in found} if found else {}
dic={k:g.strip().split(';') for k,g in dic.items()}
all_together=[]
all_together+=[(g.strip().replace('and ',''),len(dic['growth'])-i) for i,g in enumerate(dic['growth']) ]
if 'no change' in dic:
    all_together+=[(g.strip().replace('and ',''),0) for i,g in enumerate(dic['no change']) ]  
if 'decreases' in dic:
     all_together+=[(g.strip().replace('and ',''),-len(dic['decreases'])-i) for i,g in enumerate(dic['decreases']) ]
if 'contraction' in dic:
     all_together+=[(g.strip().replace('and ',''),-len(dic['contraction'])-i) for i,g in enumerate(dic['contraction']) ]

all_together_ser = sorted(all_together,key=lambda x: -x[1])
print (all_together_ser)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM