python 正则表达式计算英语元音/辅音比率

[英]python regex to calculate vowel/consonant ratio in English

I've embarked on a reasonably dumb linguistics project to learn regular expressions in Python.我已经开始了一个相当愚蠢的语言学项目来学习 Python 中的正则表达式。 I'm pretty sure I could avoid the multiple passes over the same string, and find a more "compact" and "pythonic" way to do what I'm trying to do, which is: calculate using regex whether 'Y|y' in a word is a vowel or a consonant.我很确定我可以避免多次通过同一个字符串,并找到一种更“紧凑”和“pythonic”的方式来做我想做的事情,即:使用正则表达式计算是否'Y|y'总之是元音或辅音。 At the bottom of the code segment, I've put in a comment block 20 words containing 12 vowel y's and 9 consonant y's.在代码段的底部,我放入了一个包含 12 个元音 y 和 9 个辅音 y 的 20 个单词的注释块。 Seems like the code could be simplified and the re.compile lines merged together.似乎可以简化代码并将 re.compile 行合并在一起。

import re
vowelRegex = re.compile(r'[aeiouAEIOU]')
consoRegex = re.compile(r'[b-df-hj-np-tv-xzB-DF-HJ-NP-TV-XZ]')
yconsRegex = re.compile(r'[aeiou]y[aeiou]') 
ycon2Regex = re.compile(r'\bY')
yVowlRegex = re.compile(r'[b-df-hj-np-tv-xzB-DF-HJ-NP-TV-XZ]y[b-df-hj-np-tv-xz]') 
yVow2Regex = re.compile(r'y\b')

#thestring = 'Sky Family Yurt Germany Crypt Day New York Pennsylvania Myth Hungry Yolk Year Bayou Yak Silly Beyond Dynamite Mystery Yacht Yoda'
#thestring = 'Crypt Pennsylva Myth Dynamite Mystery'
#thestring='RoboCop eats baby food. Pennsylvania Baby Food in the bayou. And, New York is where I\'d Rather be!'
thestring='violent irrational intolerant allied to racism and ' \
    'tribalism bigotry invested in ignorance and hostile to free '\
    'inquiry contemptuous of women and coercive towards children ' \
    'organized religion ought to have a great deal on its conscience ' \
    'Yak yacht beyond mystery'
funny = yVowlRegex.findall(thestring) 
foony = []
for f in funny:
    foony.append (f[1])
fun += foony   
fun += yVow2Regex.findall(thestring)
notfunny = yconsRegex.findall(thestring)

foony = []
for f in notfunny:
    foony.append (f[1])
nofun += foony
nofun += ycon2Regex.findall(thestring)

print('Vowels:',''.join(fun), len(''.join(fun)))
print('Consos:',''.join(nofun), len(''.join(nofun)))

You can use an or operator in regex, that could reduce it a bit.您可以在正则表达式中使用 or 运算符,这可以减少一点。 For example:例如:

yVowlRegex = re.compile(r'[b-df-hj-np-tv-xzB-DF-HJ-NP-TV-XZ]y[b-df-hj-np-tv-xz]|y\b') 

now includes both yVowl and yVow2现在包括 yVowl 和 yVow2

@Joshua-Lewis answer led me to the following way to streamline the code above: @Joshua-Lewis 的回答让我采用了以下方法来简化上面的代码:

import re
vowelRegex = re.compile(r'[aeiouAEIOU]|[b-df-hj-np-tv-xzB-DF-HJ-NP-TV-XZ]y[b-df-hj-np-tv-xz]|y\b')
consoRegex = re.compile(r'[b-df-hj-np-tv-xzB-DF-HJ-NP-TV-XZ]|[aeiou]y[aeiou]|\bY')
vowelRescan = re.compile(r'[aeiouyAEIOUY]')
consoRescan = re.compile(r'[b-df-hj-np-tv-xyzB-DF-HJ-NP-TV-XYZ]')
thestring='any and every religion is violent irrational intolerant '\
    'allied to racism and tribalism bigotry invested in ignorance and '\
    'hostile to free inquiry contemptuous of women and coercive towards '\
    'children organized religion ought to have a great deal on its '\
    'conscience why it continues toward the 22nd century ACE is a mystery '\
    'known only to New Yorkers and lovers of the bayou'
fun = ''.join(vowelRescan.findall(funn))

print('Vowels:',fun, len(fun))
print('Consos:',nofun, len(nofun))

