简体   繁体   English

正则表达式查找以特定字母开头或结尾的单词

[英]Regex to find words that start or end with a particular letter

Write a function called getWords(sentence, letter) that takes in a sentence and a single letter, and returns a list of the words that start or end with this letter, but not both, regardless of the letter case. 编写一个名为getWords(sentence, letter)的函数getWords(sentence, letter)该函数接受一个句子和一个字母,并返回以该字母开头或结尾的单词的列表,但不管字母大小写如何,都不能返回两个单词。

For example: 例如:

>>> s = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
>>> getWords(s, "t")
['The', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'next']

My attempt: 我的尝试:

regex = (r'[\w]*'+letter+r'[\w]*')
return (re.findall(regex,sentence,re.I))

My Output: 我的输出:

['The', 'TART', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'until', 'next']

\\b detects word breaks. \\b检测到断字。 Verbose mode allows multi-line regexs and comments. 详细模式允许多行正则表达式和注释。 Note that [^\\W] is the same as \\w , but to match \\w except a certain letter, you need [^\\W{letter}] . 请注意, [^\\W]\\w相同,但是要匹配\\w除了某个字母之外,您需要[^\\W{letter}]

import re

def getWords(s,t):
    pattern = r'''(?ix)           # ignore case, verbose mode
                  \b{letter}      # start with letter
                  \w*             # zero or more additional word characters
                  [^{letter}\W]\b # ends with a word character that isn't letter
                  |               #    OR
                  \b[^{letter}\W] # does not start with a non-word character or letter
                  \w*             # zero or more additional word characters
                  {letter}\b      # ends with letter
                  '''.format(letter=t)
    return re.findall(pattern,s)

s = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
print(getWords(s,'t'))

Output: 输出:

['The', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'next']

Doing this is much easy with the startswith() and endswith() method. 使用startswith()endswith()方法很容易做到这一点。

def getWords(s, letter):
    return ([word for word in mystring.split() if (word.lower().startswith('t') or 
                word.lower().endswith('t')) and not 
                    (word.lower().startswith('t') and word.lower().endswith('t'))])

mystring = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
print(getWords(mystring, 't'))

Output 输出量

['The', 'Tuesdays', 'Thursdays,', 'but', 'it', 'not', 'start', 'next']

Update (using regular expression) 更新(使用正则表达式)

import re
result1 = re.findall(r'\b[t]\w+|\w+[t]\b', mystring, re.I)
result2 = re.findall(r'\b[t]\w+[t]\b', mystring, re.I)
print([x for x in result1 if x not in result2])

Explanation 说明

Regular expression \\b[t]\\w+ and \\w+[t]\\b finds words that start and ends with letter t and \\b[t]\\w+[t]\\b finds words that both starts and ends with letter t . 正则表达式\\b[t]\\w+\\w+[t]\\b查找以字母t开头和结尾的单词,而\\b[t]\\w+[t]\\b查找以字母t开头和结尾的单词。

After generating two lists of words, just take the intersection of those two lists. 生成两个单词列表后,只需取这两个列表的交集即可。

Why are you using regex for this? 为什么要为此使用正则表达式? Just check the first and last character. 只需检查第一个和最后一个字符。

def getWords(s, letter):
    words = s.split()
    return [a for a,b in ((word, set(word.lower()[::len(word)-1])) for word in words) if letter in b and len(b)==2]

It you want the regex for this, then use: 如果要使用正则表达式,则使用:

regex = r'\b(#\w*[^#\W]|[^#\W]\w*#)\b'.replace('#', letter)

The replace is done to avoid the repeated verbose +letter+ . 进行replace是为了避免重复的冗长+letter+

So the code looks like this then: 因此,代码如下所示:

import re

def getWords(sentence, letter):
    regex = r'\b(#\w*[^#\W]|[^#\W]\w*#)\b'.replace('#', letter)
    return re.findall(regex, sentence, re.I)

s = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
result = getWords(s, "t")
print(result)

Output: 输出:

['The', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'next']

Explanation 说明

I have used # as a placeholder for the actual letter, and that will get replaced in the regular expression before it is actually used. 我已经将#用作实际字母的占位符,并且在实际使用前将其替换为正则表达式。

  • \\b : word break \\b :断字
  • \\w* : 0 or more letters (or underscores) \\w* :0个或多个字母(或下划线)
  • [^#\\W] : a letter that is not # (the given letter) [^#\\W] :不是#的字母(给定字母)
  • | : logical OR. :逻辑或。 The left side matches words that start with the letter, but don't end with it, and the right side matches the opposite case. 左侧匹配以字母开头但不以字母结尾的单词,右侧匹配相反的大小写。

You can try the builtin startswith and endswith functions. 您可以尝试内置的startswithendswith函数。

>>> string = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
>>> [i for i in string.split() if i.lower().startswith('t') or i.lower().endswith('t')]
['The', 'TART', 'Tuesdays', 'Thursdays,', 'but', 'it', 'not', 'start', 'next']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM