简体   繁体   中英

Regex - Match words in pattern, except within email address

I'm looking to find words in a string that match a specific pattern. Problem is, if the words are part of an email address, they should be ignored.

To simplify, the pattern of the "proper words" \\w+\\.\\w+ - one or more characters, an actual period, and another series of characters.

The sentence that causes problem, for example, is aa bb:cc dd@eee .

The goal is to match only [aa, bb, cc] . With most Regexes I build, ee returns as well (because I use some word boundary match).

For example:

>>> re.findall(r"(?:^|\\s|\\W)(?<!@)(\\w+\\.\\w+)(?!@)\\b", "aa bb:cc dd@eee") ['a.a', 'b.b', 'c.c', 'e.e']

How can I match only among words that do not contain "@"?

I would definitely clean it up first and simplify the regex.

first we have

words = re.split(r':|\s', "a.a b.b:c.c d.d@e.e.e")

then filter out the words that have an @ in them.

words = [re.search(r'^((?!@).)*$', word) for word in words]

Properly parsing email addresses with a regex is extremely hard, but for your simplified case, with a simple definition of word ~ \\w\\.\\w and the email ~ any sequence that contains @ , you might find this regex to do what you need:

>>> re.findall(r"(?:^|[:\s]+)(\w+\.\w+)(?=[:\s]+|$)", "a.a b.b:c.c d.d@e.e.e")
['a.a', 'b.b', 'c.c']

The trick here is not to focus on what comes in the next or previous word, but on what the word currently captured has to look like.

Another trick is in properly defining word separators. Before the word we'll allow multiple whitespaces, : and string start, consuming those characters, but not capturing them. After the word we require almost the same (except string end, instead of start), but we do not consume those characters - we use a lookahead assertion.

You may match the email-like substrings with \\S+@\\S+\\.\\S+ and match and capture your pattern with (\\w+\\.\\w+) in all other contexts. Use re.findall to only return captured values and filter out empty items (they will be in re.findall results when there is an email match):

import re
rx = r"\S+@\S+\.\S+|(\w+\.\w+)"
s = "a.a b.b:c.c d.d@e.e.e"
res = filter(None, re.findall(rx, s))
print(res)
# => ['a.a', 'b.b', 'c.c']

See the Python demo .

See the regex demo .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM