简体   繁体   中英

Regex to find words that start or end with a particular letter

Write a function called getWords(sentence, letter) that takes in a sentence and a single letter, and returns a list of the words that start or end with this letter, but not both, regardless of the letter case.

For example:

>>> s = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
>>> getWords(s, "t")
['The', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'next']

My attempt:

regex = (r'[\w]*'+letter+r'[\w]*')
return (re.findall(regex,sentence,re.I))

My Output:

['The', 'TART', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'until', 'next']

\\b detects word breaks. Verbose mode allows multi-line regexs and comments. Note that [^\\W] is the same as \\w , but to match \\w except a certain letter, you need [^\\W{letter}] .

import re

def getWords(s,t):
    pattern = r'''(?ix)           # ignore case, verbose mode
                  \b{letter}      # start with letter
                  \w*             # zero or more additional word characters
                  [^{letter}\W]\b # ends with a word character that isn't letter
                  |               #    OR
                  \b[^{letter}\W] # does not start with a non-word character or letter
                  \w*             # zero or more additional word characters
                  {letter}\b      # ends with letter
                  '''.format(letter=t)
    return re.findall(pattern,s)

s = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
print(getWords(s,'t'))

Output:

['The', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'next']

Doing this is much easy with the startswith() and endswith() method.

def getWords(s, letter):
    return ([word for word in mystring.split() if (word.lower().startswith('t') or 
                word.lower().endswith('t')) and not 
                    (word.lower().startswith('t') and word.lower().endswith('t'))])

mystring = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
print(getWords(mystring, 't'))

Output

['The', 'Tuesdays', 'Thursdays,', 'but', 'it', 'not', 'start', 'next']

Update (using regular expression)

import re
result1 = re.findall(r'\b[t]\w+|\w+[t]\b', mystring, re.I)
result2 = re.findall(r'\b[t]\w+[t]\b', mystring, re.I)
print([x for x in result1 if x not in result2])

Explanation

Regular expression \\b[t]\\w+ and \\w+[t]\\b finds words that start and ends with letter t and \\b[t]\\w+[t]\\b finds words that both starts and ends with letter t .

After generating two lists of words, just take the intersection of those two lists.

Why are you using regex for this? Just check the first and last character.

def getWords(s, letter):
    words = s.split()
    return [a for a,b in ((word, set(word.lower()[::len(word)-1])) for word in words) if letter in b and len(b)==2]

It you want the regex for this, then use:

regex = r'\b(#\w*[^#\W]|[^#\W]\w*#)\b'.replace('#', letter)

The replace is done to avoid the repeated verbose +letter+ .

So the code looks like this then:

import re

def getWords(sentence, letter):
    regex = r'\b(#\w*[^#\W]|[^#\W]\w*#)\b'.replace('#', letter)
    return re.findall(regex, sentence, re.I)

s = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
result = getWords(s, "t")
print(result)

Output:

['The', 'Tuesdays', 'Thursdays', 'but', 'it', 'not', 'start', 'next']

Explanation

I have used # as a placeholder for the actual letter, and that will get replaced in the regular expression before it is actually used.

  • \\b : word break
  • \\w* : 0 or more letters (or underscores)
  • [^#\\W] : a letter that is not # (the given letter)
  • | : logical OR. The left side matches words that start with the letter, but don't end with it, and the right side matches the opposite case.

You can try the builtin startswith and endswith functions.

>>> string = "The TART program runs on Tuesdays and Thursdays, but it does not start until next week."
>>> [i for i in string.split() if i.lower().startswith('t') or i.lower().endswith('t')]
['The', 'TART', 'Tuesdays', 'Thursdays,', 'but', 'it', 'not', 'start', 'next']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM