简体   繁体   中英

Modify data column with regex

I have a dataset called data. Theres a column called networkDomain that looks like this, data['networkDomain']:

0                amazonaws.com
1               vodafone-ip.de
2             ask4internet.com
3                   actcorp.in
4                    (not set)
5                    (not set)
6                   druknet.bt
7              unknown.unknown
8         alliancebroadband.in
9                  vsnl.net.in
10          grandenetworks.net
11             superonline.net
12                   (not set)
13             unknown.unknown
14             unknown.unknown
15                  fidnet.com
16                   (not set)
17             telepacific.net
18                    pldt.net
19        networkbackup.com.au

I would like to filter all the values ending with '.com' or '.net' using regex and assign all other values as 0.

I've tried data['networkDomain'][data['networkDomain'].str.contains(".com$|.net$", regex=True)] which returns:

0                  amazonaws.com
2               ask4internet.com
10            grandenetworks.net
11               superonline.net
15                    fidnet.com
17               telepacific.net
18                      pldt.net
22                       tdc.net
24                     qwest.net
26                     hinet.net
27                     ztomy.com
29                netvigator.com
30                    level3.net
31                   virginm.net
32                        rr.com
41                 sbcglobal.net
49                      pldt.net
51                  1asiacom.net
56                     yesup.net
59                 btireland.net
60                     avast.com

How can I set all the other values in data[networkDomain] which aren't '.net' or '.com' to be 0?

You can use DataFrame.apply , which will apply a function along an axis of the DataFrame .

>>> import re
>>> import pandas as pd
>>> regex = re.compile(r".com$|.net$")
>>>
>>> def my_func(row):
...     if regex.search(row):
...         return row
...     return 0  # default
...
>>> df = pd.DataFrame(
...     [
...         {"Domain": " amazonaws.com"},
...         {"Domain": " amazonaws2.com"},
...         {"Domain": " amazonaws.net"},
...         {"Domain": "(not set)"},
...     ]
... )
>>>
>>> df["Domain"] = df["Domain"].apply(my_func)
>>> print(df)
            Domain
0    amazonaws.com
1   amazonaws2.com
2    amazonaws.net
3                0

Determine the row which doesn't satisfy the condition and modify the value of this row

import re
for i, j in enumerate(data.loc[:,'networkDomain']):
    if len(re.findall(r'\.com$|\.net$', j))==0:
        data.loc[i,'networkDomain'] = 0
print(data)

Use DataFrame.apply() to apply a function to every row in the series, note args argument must be passed as a tuple:

from pandas import DataFrame
import re

d={'col': [1,2,3], 'col2': ['a.net',2,3]}

df=DataFrame(columns=d.keys(), data=d)

def mask0(s, pattern):

    s =str(s)

if re.match(pattern, s):
    return s
else:
    return 0

pat = re.compile('.+[\.net|\.com]')
df['col2'] = df['col2'].apply(mask0, args=(pat,))

print(df)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM