简体   繁体   中英

How to extract unknown number of different parts from string with Python regex?

Does anyone know a smart way to extract unknown number of different parts from a string with Python regex?

I know this question is probably too general to answer clearly, so please let's have a look at the example:

S = "name.surname@sub1.sub2.sub3"

As a result I would like to get separately a local part and each subdomain. Please note that in this sample email address we have three subdomains but I would like to find a regular expression that is able to capture any number of them, so please do not use this number. To avoid straying from the point, let's additionaly assume only alphanumeric characters (hence \\w ), dots and one @ are allowed in email addresses.

I tried to solve it myself and found this way:

L = re.findall(r"([\w.]+)(?=@)|(\w+)", S)
for i in L:
    if i[0] == '': print i[1],
    else:          print i[0],
# output: name.surname sub1 sub2 sub3

But it doesn't look nice to me. Does anyone know a way to achieve this with one regex and without any loop?

Of course, we can easily do it without regular expressions:

L = S.split('@')
localPart = L[0]                  # name.surname
subdomains = str(L[1]).split('.') # ['sub1', 'sub2', 'sub3']

But I am interested in how to figure it out with regexes.

[EDIT]

Uff, finally I figured this out, here is the nice solution:

S = "name.surname@sub1.sub2.sub3"
print re.split(r"@|\.(?!.*@)", S) # ['name.surname', 'sub1', 'sub2', 'sub3']
S = "name.surname.nick@sub1.sub2.sub3.sub4"
print re.split(r"@|\.(?!.*@)", S) # ['name.surname.nick', 'sub1', 'sub2', 'sub3', 'sub4']

Perfect output.

If I am understanding your request correctly, you want to find each section in your sample email address, without the periods. What you are missing in your sample regex snippet is re.compile . For example:

import re
s = "name.surname@sub1.sub2.sub3"
r = "\w+"
r2 = re.compile(r)
re.findall(r2,s)

This looks for the r2 regex object in the string s and outputs ['name', 'surname', 'sub1', 'sub2', 'sub3'] .

Basically you can use the fact that when there's a capture group in the pattern, re.findall returns only the content of this capture group and no more the whole match:

>>> re.findall(r'(?:^[^@]*@|\.)([^.]*)', s)
['sub1', 'sub2', 'sub3']

Obviously the email format can be more complicated than your example string.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM