如何在Python中仅提取url的特定部分并将其值添加为df中每一行的另一列？

Question

I have a df containing user and url looking like this. 我有一个包含用户和网址的df，看起来像这样。

df

User      Url
1         http://www.mycompany.com/Overview/Get
2         http://www.mycompany.com/News
3         http://www.mycompany.com/Accountinfo
4         http://www.mycompany.com/Personalinformation/Index
...

I want to add another column page that only takes the second part of the url, so I'd be having it like this. 我想添加另一个仅包含网址第二部分的列页面，所以我会像这样。

user      url                                                  page
1         http://www.mycompany.com/Overview/Get                Overview
2         http://www.mycompany.com/News                        News
3         http://www.mycompany.com/Accountinfo                 Accountinfo
4         http://www.mycompany.com/Personalinformation/Index   Personalinformation
...

My code below is not working. 我下面的代码无法正常工作。

slashparts = df['url'].split('/')
df['page'] = slashparts[4]

The error I'm getting 我得到的错误

  AttributeError                            Traceback (most recent call last)
  <ipython-input-23-0350a98a788c> in <module>()
  ----> 1 slashparts = df['request_url'].split('/')
        2 df['page'] = slashparts[1]

  ~\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   4370             if 
   self._info_axis._can_hold_identifiers_and_holds_name(name):
   4371                 return self[name]
  -> 4372             return object.__getattribute__(self, name)
   4373 
   4374     def __setattr__(self, name, value):

 AttributeError: 'Series' object has no attribute 'split'

Answer 1

Use pandas text functions with str and for select 4. lists use str[3] , because python counts from 0 : 将pandas 文本函数与str和select 4.使用str[3]使用str[3] ，因为python从0计数：

df['page'] = df['Url'].str.split('/').str[3]

Or if performance is important use list comprehension : 或者，如果性能很重要，请使用list comprehension ：

df['page'] = [x.split('/')[3] for x in df['Url']]

print (df)
   User                                                Url  \
0     1              http://www.mycompany.com/Overview/Get   
1     2                      http://www.mycompany.com/News   
2     3               http://www.mycompany.com/Accountinfo   
3     4  http://www.mycompany.com/Personalinformation/I...   

                  page  
0             Overview  
1                 News  
2          Accountinfo  
3  Personalinformation

Answer 2

I'm attempting to be a little more explicit to handle where http might be missing and other variations 我试图更加明确地处理可能会丢失http和其他变体的地方

pat = '(?:https?://)?(?:www\.)?(?:\w+\.\w+\/)([^/]*)'
df.assign(page=df.Url.str.extract(pat, expand=False))

   User                                                Url                 page
0     1              http://www.mycompany.com/Overview/Get             Overview
1     2                      http://www.mycompany.com/News                 News
2     3                      www.mycompany.com/Accountinfo          Accountinfo
3     1              http://www.mycompany.com/Overview/Get             Overview
4     2                                 mycompany.com/News                 News
5     3              https://www.mycompany.com/Accountinfo          Accountinfo
6     4  http://www.mycompany.com/Personalinformation/I...  Personalinformation

如何在Python中仅提取url的特定部分并将其值添加为df中每一行的另一列？

问题描述

2 个解决方案

解决方案1
3 已采纳 2018-08-29 12:45:53

解决方案2
2 2018-08-29 12:58:39

如何在Python中仅提取url的特定部分并将其值添加为df中每一行的另一列？

问题描述

2 个解决方案

解决方案1 3 已采纳 2018-08-29 12:45:53

解决方案2 2 2018-08-29 12:58:39

解决方案1
3 已采纳 2018-08-29 12:45:53

解决方案2
2 2018-08-29 12:58:39