I am trying to capture the domains of these email list. I have sub domains in the email and trying to remove it. I just need a string before and after '.' from backwards
ids = [1,2,3,4,5,6,7,8]
emails = ['gmail.com','aol.com','','123.abc.edu','123.er.abc.edu','','abc.gov','test.net']
df = pd.DataFrame({'ids':ids,'emails':emails})
df
ids emails
0 1 gmail.com
1 2 aol.com
2 3
3 4 123.abc.edu
4 5 123.er.abc.edu
5 6
6 7 abc.gov
7 8 test.net
Tried this and combinations of -1, 2:...etc
df.emails.str.split(".", 1).str[-1]
0 com
1 com
2
3 abc.edu
4 er.abc.edu
5
6 gov
7 net
Need output like this one
ids emails
0 1 gmail.com
1 2 aol.com
2 3
3 4 abc.edu
4 5 abc.edu
5 6
6 7 abc.gov
7 8 test.net
By passing 1
as the second argument to split()
you limit the split to one.
Use instead:
df.emails.str.split(".").str[-2:]
to obtain the last two segments of the split string:
0 [gmail, com]
1 [aol, com]
2 []
3 [abc, edu]
4 [abc, edu]
5 []
6 [abc, gov]
7 [test, net]
To get the output as a string including the dot, chain a method to join the previous output:
In []: df.emails.str.split(".").str[-2:].str.join(".")
Out[]:
0 gmail.com
1 aol.com
2
3 abc.edu
4 abc.edu
5
6 abc.gov
7 test.net
Name: emails, dtype: object
You can pre process emails list
emails = ['gmail.com','aol.com','','123.abc.edu','123.er.abc.edu','','abc.gov','test.net']
emails_filtered = []
for email in emails:
if '.' in email:
emails_filtered.append( '.'.join( [ email.split('.')[:-2] ] ) )
else:
emails_filtered.append('')
df = pd.DataFrame({'ids':ids,'emails':emails_filtered})
hope it helps.
尝试这个
df.emails.str.split(".").str[-2:].str.join(sep='.')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.