简体   繁体   中英

Using Regex to find words with characters that are the same or that are different

I have a list of words such as:

l = """abca
bcab
aaba
cccc
cbac
babb
"""

I want to find the words that have the same first and last character, and that the two middle characters are different from the first/last character.

The desired final result:

['abca', 'bcab', 'cbac']

I tried this:

re.findall('^(.)..\\1$', l, re.MULTILINE)

But it returns all of the unwanted words as well. I thought of using [^...] somehow, but I couldn't figure it out. There's a way of doing this with sets (to filter the results from the search above), but I'm looking for a regex.

Is it possible?

There are lots of ways to do this. Here's probably the simplest:

re.findall(r'''
           \b          #The beginning of a word (a word boundary)
           ([a-z])     #One letter
           (?!\w*\1\B) #The rest of this word may not contain the starting letter except at the end of the word
           [a-z]*      #Any number of other letters
           \1          #The starting letter we captured in step 2
           \b          #The end of the word (another word boundary)
           ''', l, re.IGNORECASE | re.VERBOSE)

If you want, you can loosen the requirements a bit by replacing [az] with \\w . That will allow numbers and underscores as well as letters. You can also restrict it to 4-character words by changing the last * in the pattern to {2} .

Note also that I'm not very familiar with Python, so I'm assuming your usage of findall is correct.

Edit: fixed to use negative lookahead assertions instead of negative lookbehind assertions. Read comments for @AlanMoore and @bukzor explanations.

>>> [s for s in l.splitlines() if re.search(r'^(.)(?!\1).(?!\1).\1$', s)]
['abca', 'bcab', 'cbac']

The solution uses negative lookahead assertions which means 'match the current position only if it isn't followed by a match for something else.' Now, take a look at the lookahead assertion - (?!\\1) . All this means is 'match the current character only if it isn't followed by the first character.'

To heck with regexes.

[
    word
    for word in words.split('\n')
    if word[0] == word[-1]
    and word[0] not in word[1:-1]
]

Are you required to use regexes? This is a much more pythonic way to do the same thing:

l = """abca
bcab
aaba
cccc
cbac
babb
"""

for word in l.split():
  if word[-1] == word[0] and word[0] not in word[1:-1]:
     print word

Here's how I would do it:

result = re.findall(r"\b([a-z])(?:(?!\1)[a-z]){2}\1\b", subject)

This is similar to Justin's answer, except where that one does a one-time lookahead, this one checks each letter as it's consumed.

\b
([a-z])  # Capture the first letter.
(?:
  (?!\1)   # Unless it's the same as the first letter...
  [a-z]    # ...consume another letter.
){2}
\1
\b

I don't know what your real data looks like, so chose [az] arbitrarily because it works with your sample data. I limited the length to four characters for the same reason. As with Justin's answer, you may want to change the {2} to * , + or some other quantifier.

You can do this with negative lookahead or lookbehind assertions; see http://docs.python.org/library/re.html for details.

Not a Python guru, but maybe this

re.findall('^(.)(?:(?!\1).)*\1$', l, re.MULTILINE)

expanded (use multi-line modifier):

^                # begin of line
  (.)            # capture grp 1, any char except newline
  (?:            # grouping
     (?!\1)         # Lookahead assertion, not what was in capture group 1 (backref to 1)
     .              # this is ok, grab any char except newline
  )*             # end grouping, do 0 or more times (could force length with {2} instead of *)
  \1             # backref to group 1, this character must be the same
$                # end of line

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM