[英]How to extract only a specific part of url in Python and add its value as another column in df for every row?
I have a df containing user and url looking like this. 我有一个包含用户和网址的df,看起来像这样。
df
User Url
1 http://www.mycompany.com/Overview/Get
2 http://www.mycompany.com/News
3 http://www.mycompany.com/Accountinfo
4 http://www.mycompany.com/Personalinformation/Index
...
I want to add another column page that only takes the second part of the url, so I'd be having it like this. 我想添加另一个仅包含网址第二部分的列页面,所以我会像这样。
user url page
1 http://www.mycompany.com/Overview/Get Overview
2 http://www.mycompany.com/News News
3 http://www.mycompany.com/Accountinfo Accountinfo
4 http://www.mycompany.com/Personalinformation/Index Personalinformation
...
My code below is not working. 我下面的代码无法正常工作。
slashparts = df['url'].split('/')
df['page'] = slashparts[4]
The error I'm getting 我得到的错误
AttributeError Traceback (most recent call last)
<ipython-input-23-0350a98a788c> in <module>()
----> 1 slashparts = df['request_url'].split('/')
2 df['page'] = slashparts[1]
~\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
4370 if
self._info_axis._can_hold_identifiers_and_holds_name(name):
4371 return self[name]
-> 4372 return object.__getattribute__(self, name)
4373
4374 def __setattr__(self, name, value):
AttributeError: 'Series' object has no attribute 'split'
Use pandas text functions with str
and for select 4.
lists use str[3]
, because python counts from 0
: 将pandas 文本函数与
str
和select 4.
使用str[3]
使用str[3]
,因为python从0
计数:
df['page'] = df['Url'].str.split('/').str[3]
Or if performance is important use list comprehension
: 或者,如果性能很重要,请使用
list comprehension
:
df['page'] = [x.split('/')[3] for x in df['Url']]
print (df)
User Url \
0 1 http://www.mycompany.com/Overview/Get
1 2 http://www.mycompany.com/News
2 3 http://www.mycompany.com/Accountinfo
3 4 http://www.mycompany.com/Personalinformation/I...
page
0 Overview
1 News
2 Accountinfo
3 Personalinformation
I'm attempting to be a little more explicit to handle where http
might be missing and other variations 我试图更加明确地处理可能会丢失
http
和其他变体的地方
pat = '(?:https?://)?(?:www\.)?(?:\w+\.\w+\/)([^/]*)'
df.assign(page=df.Url.str.extract(pat, expand=False))
User Url page
0 1 http://www.mycompany.com/Overview/Get Overview
1 2 http://www.mycompany.com/News News
2 3 www.mycompany.com/Accountinfo Accountinfo
3 1 http://www.mycompany.com/Overview/Get Overview
4 2 mycompany.com/News News
5 3 https://www.mycompany.com/Accountinfo Accountinfo
6 4 http://www.mycompany.com/Personalinformation/I... Personalinformation
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.