简体   繁体   中英

How to extract only a specific part of url in Python and add its value as another column in df for every row?

I have a df containing user and url looking like this.

df

User      Url
1         http://www.mycompany.com/Overview/Get
2         http://www.mycompany.com/News
3         http://www.mycompany.com/Accountinfo
4         http://www.mycompany.com/Personalinformation/Index
...

I want to add another column page that only takes the second part of the url, so I'd be having it like this.

user      url                                                  page
1         http://www.mycompany.com/Overview/Get                Overview
2         http://www.mycompany.com/News                        News
3         http://www.mycompany.com/Accountinfo                 Accountinfo
4         http://www.mycompany.com/Personalinformation/Index   Personalinformation
...

My code below is not working.

slashparts = df['url'].split('/')
df['page'] = slashparts[4]

The error I'm getting

  AttributeError                            Traceback (most recent call last)
  <ipython-input-23-0350a98a788c> in <module>()
  ----> 1 slashparts = df['request_url'].split('/')
        2 df['page'] = slashparts[1]

  ~\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   4370             if 
   self._info_axis._can_hold_identifiers_and_holds_name(name):
   4371                 return self[name]
  -> 4372             return object.__getattribute__(self, name)
   4373 
   4374     def __setattr__(self, name, value):

 AttributeError: 'Series' object has no attribute 'split'

Use pandas text functions with str and for select 4. lists use str[3] , because python counts from 0 :

df['page'] = df['Url'].str.split('/').str[3]

Or if performance is important use list comprehension :

df['page'] = [x.split('/')[3] for x in df['Url']]

print (df)
   User                                                Url  \
0     1              http://www.mycompany.com/Overview/Get   
1     2                      http://www.mycompany.com/News   
2     3               http://www.mycompany.com/Accountinfo   
3     4  http://www.mycompany.com/Personalinformation/I...   

                  page  
0             Overview  
1                 News  
2          Accountinfo  
3  Personalinformation  

I'm attempting to be a little more explicit to handle where http might be missing and other variations

pat = '(?:https?://)?(?:www\.)?(?:\w+\.\w+\/)([^/]*)'
df.assign(page=df.Url.str.extract(pat, expand=False))

   User                                                Url                 page
0     1              http://www.mycompany.com/Overview/Get             Overview
1     2                      http://www.mycompany.com/News                 News
2     3                      www.mycompany.com/Accountinfo          Accountinfo
3     1              http://www.mycompany.com/Overview/Get             Overview
4     2                                 mycompany.com/News                 News
5     3              https://www.mycompany.com/Accountinfo          Accountinfo
6     4  http://www.mycompany.com/Personalinformation/I...  Personalinformation

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM