[英]Using Regex to find words with characters that are the same or that are different
I have a list of words such as: 我有一个单词列表,如:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
I want to find the words that have the same first and last character, and that the two middle characters are different from the first/last character. 我想找到具有相同的第一个和最后一个字符的单词,并且两个中间字符与第一个/最后一个字符不同。
The desired final result: 期望的最终结果:
['abca', 'bcab', 'cbac']
I tried this: 我试过这个:
re.findall('^(.)..\\1$', l, re.MULTILINE)
But it returns all of the unwanted words as well. 但它也会返回所有不需要的单词。 I thought of using [^...] somehow, but I couldn't figure it out.
我想以某种方式使用[^ ...],但我无法理解。 There's a way of doing this with sets (to filter the results from the search above), but I'm looking for a regex.
有一种方法可以使用集合(以过滤上面搜索的结果),但我正在寻找正则表达式。
Is it possible? 可能吗?
There are lots of ways to do this. 有很多方法可以做到这一点。 Here's probably the simplest:
这可能是最简单的:
re.findall(r'''
\b #The beginning of a word (a word boundary)
([a-z]) #One letter
(?!\w*\1\B) #The rest of this word may not contain the starting letter except at the end of the word
[a-z]* #Any number of other letters
\1 #The starting letter we captured in step 2
\b #The end of the word (another word boundary)
''', l, re.IGNORECASE | re.VERBOSE)
If you want, you can loosen the requirements a bit by replacing [az]
with \\w
. 如果需要,可以通过用
\\w
替换[az]
稍微放松一下这些要求。 That will allow numbers and underscores as well as letters. 这将允许数字和下划线以及字母。 You can also restrict it to 4-character words by changing the last
*
in the pattern to {2}
. 您还可以通过将模式中的最后一个
*
更改为{2}
来将其限制为4个字符的单词。
Note also that I'm not very familiar with Python, so I'm assuming your usage of findall
is correct. 另请注意,我对Python不是很熟悉,所以我假设你对
findall
的使用是正确的。
Edit: fixed to use negative lookahead assertions instead of negative lookbehind assertions. 编辑:修复为使用负前瞻断言而不是负后瞻断言。 Read comments for @AlanMoore and @bukzor explanations.
阅读@AlanMoore和@bukzor解释的评论。
>>> [s for s in l.splitlines() if re.search(r'^(.)(?!\1).(?!\1).\1$', s)]
['abca', 'bcab', 'cbac']
The solution uses negative lookahead assertions which means 'match the current position only if it isn't followed by a match for something else.' 该解决方案使用负前瞻断言 ,这意味着“只有在没有匹配其他内容时才匹配当前位置”。 Now, take a look at the lookahead assertion -
(?!\\1)
. 现在,看一下前瞻断言 -
(?!\\1)
。 All this means is 'match the current character only if it isn't followed by the first character.' 所有这些意味着'只有在第一个字符后面没有后跟时才匹配当前字符'。
To heck with regexes. 用正则表达式来解决问题。
[
word
for word in words.split('\n')
if word[0] == word[-1]
and word[0] not in word[1:-1]
]
Are you required to use regexes? 你需要使用正则表达式吗? This is a much more pythonic way to do the same thing:
这是一种更加pythonic的方式来做同样的事情:
l = """abca
bcab
aaba
cccc
cbac
babb
"""
for word in l.split():
if word[-1] == word[0] and word[0] not in word[1:-1]:
print word
Here's how I would do it: 我是这样做的:
result = re.findall(r"\b([a-z])(?:(?!\1)[a-z]){2}\1\b", subject)
This is similar to Justin's answer, except where that one does a one-time lookahead, this one checks each letter as it's consumed. 这类似于贾斯汀的答案,除非那个人做了一次性的预测,这一个检查每个字母消耗它。
\b
([a-z]) # Capture the first letter.
(?:
(?!\1) # Unless it's the same as the first letter...
[a-z] # ...consume another letter.
){2}
\1
\b
I don't know what your real data looks like, so chose [az]
arbitrarily because it works with your sample data. 我不知道您的真实数据是什么样的,所以选择
[az]
是因为它适用于您的样本数据。 I limited the length to four characters for the same reason. 出于同样的原因,我将长度限制为四个字符。 As with Justin's answer, you may want to change the
{2}
to *
, +
or some other quantifier. 与Justin的回答一样,您可能希望将
{2}
更改为*
, +
或其他一些量词。
You can do this with negative lookahead or lookbehind assertions; 你可以用负向前瞻或后瞻性断言来做到这一点; see http://docs.python.org/library/re.html for details.
有关详细信息,请参阅http://docs.python.org/library/re.html 。
Not a Python guru, but maybe this 不是Python大师,但也许这个
re.findall('^(.)(?:(?!\1).)*\1$', l, re.MULTILINE)
expanded (use multi-line modifier): 展开(使用多行修饰符):
^ # begin of line
(.) # capture grp 1, any char except newline
(?: # grouping
(?!\1) # Lookahead assertion, not what was in capture group 1 (backref to 1)
. # this is ok, grab any char except newline
)* # end grouping, do 0 or more times (could force length with {2} instead of *)
\1 # backref to group 1, this character must be the same
$ # end of line
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.