简体   繁体   中英

Python Regex doesn't match . (dot) as a character

I have a regex that matches all three characters words in a string:

\b[^\s]{3}\b

When I use it with the string:

And the tiger attacked you.

this is the result:

regex = re.compile("\b[^\s]{3}\b")
regex.findall(string)
[u'And', u'the', u'you']

As you can see it matches you as a word of three characters, but I want the expression to take "you." with the "." as a 4 chars word.

I have the same problem with ",", ";", ":", etc.

I'm pretty new with regex but I guess it happens because those characters are treated like word boundaries.

Is there a way of doing this?

Thanks in advance,

EDIT

Thaks to the answers of @BrenBarn and @Kendall Frey I managed to get to the regex I was looking for:

(?<!\w)[^\s]{3}(?=$|\s)

If you want to make sure the word is preceded and followed by a space (and not a period like is happening in your case), then use lookaround .

(?<=\s)\w{3}(?=\s)

If you need it to match punctuation as part of words (such as 'in.') then \\w won't be adequate, and you can use \\S (anything but a space)

(?<=\s)\S{3}(?=\s)

As described in the documentation :

A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.

So if you want a period to count as a word character and not a word boundary, you can't use \\b to indicate a word boundary. You'll have to use your own character class. For instance, you can use a regex like \\s[^\\s]{3}\\s if you want to match 3 non-space characters surrounded by spaces. If you still want the boundary to be zero-width (ie, restrict the match but not be included in it), you could use lookaround, something like (?<=\\s)[^\\s]{3}(?=\\s) .

This would be my approach. Also matches words that come right after punctuations.

import re

r = r'''
        \b                   # word boundary
        (                    # capturing parentheses
            [^\s]{3}         # anything but whitespace 3 times
            \b               # word boundary
            (?=[^\.,;:]|$)   # dont allow . or , or ; or : after word boundary but allow end of string
        |                    # OR
            [^\s]{2}         # anything but whitespace 2 times
            [\.,;:]          # a . or , or ; or :
        )
    '''
s = 'And the tiger attacked you. on,bla tw; th: fo.tes'

print re.findall(r, s, re.X)

output:

['And', 'the', 'on,', 'bla', 'tw;', 'th:', 'fo.', 'tes']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM