用于在斜杠上拆分的正则表达式

Question

I am trying to split URLs to get the domain name. 我试图拆分URL以获取域名。

example.com                => example.com
example.com/dir/index.html => example.com

The regular expression I am trying to us is 我试图给我们的正则表达式是

(.+?)(/|$)

When I use it in python like this: 当我在python中使用它时，如下所示：

import re
m = re.search('(.+?)(/|$)', url)

It works for the first one, but for the second example I always get example.com/ . 它适用于第一个，但对于第二个例子，我总是得到example.com/ 。 How do I get rid of the backslash? 我如何摆脱反斜杠？

edit: I am very sorry, I forgot to include one important information. 编辑：我很抱歉，我忘了提供一个重要信息。 I need a regular expression, because I need to write this in Oracle SQL. 我需要一个正则表达式，因为我需要在Oracle SQL中编写它。 Fortunately, Oracle supports regex, but nothing like urlparse . 幸运的是，Oracle支持正则表达式，但没有像urlparse 。 I am just using python for testing. 我只是使用python进行测试。 Sorry about that! 对于那个很抱歉！

Answer 1

The easy way to do this is to use the urlparse function in the stdlib: 这样做的简单方法是在stdlib中使用urlparse函数：

>>> from urllib.parse import urlparse
>>> url = 'http://example.com/dir/index.html'
>>> p = urlparse(url)
>>> p.netloc
'example.com'

Besides being a whole lot simpler, it handles cases that you haven't thought of in a well-defined and clearly-documented way (eg, what if there's a port as well as a host?), whereas with your code, who knows what happens with any cases you didn't anticipate? 除了简单得多之外，它还处理你没有想到的明确定义和明确记录的案例（例如，如果有端口和主机？），而对于你的代码，谁知道你没有预料到的任何情况会发生什么？

If you really want to treat the URL as a string instead of a URL, the easy way to split on slashes is to split on slashes: 如果您确实希望将URL视为字符串而不是URL，则在斜杠上拆分的简单方法是在斜杠上拆分：

>>> bits = url.split('/')
>>> bits[2]
example.com

If you really want to use regexps to split on slashes, you could use re.split instead of trying to figure out a way to trick re.search into splitting for you: 如果你真的想使用正则re.split来分割斜杠，你可以使用re.split而不是试图找出一种方法来欺骗re.search为你分裂：

>>> bits = re.split('/', url)
>>> bits[2]
example.com

Finally, if you want to do it with match or search , and you don't want to capture the / , don't put the / in a capturing group, and look at the group you went out of your way to capture instead of at the whole string: 最后，如果你想用match或search来做，并且你不想捕获/ ，不要把/放在一个捕获组中，并查看你想要捕获的组，而不是在整个字符串：

>>> url = 'example.com/dir/index.html'
>>> m = re.search('(.+?)(/|$)', url)
>>> m.groups()
('example.com', '/')
>>> m = re.search('(.+?)(?:/|$)', url)
>>> m.groups()
('example.com',)

Answer 2

尝试匹配非froward斜杠，如([^/]+?)(/|$)

用于在斜杠上拆分的正则表达式

问题描述

2 个解决方案

解决方案1
2 2014-01-09 01:48:12

解决方案2
0 2014-01-09 01:44:47

用于在斜杠上拆分的正则表达式

问题描述

2 个解决方案

解决方案1 2 2014-01-09 01:48:12

解决方案2 0 2014-01-09 01:44:47

解决方案1
2 2014-01-09 01:48:12

解决方案2
0 2014-01-09 01:44:47