如何在 Python 中将 URL 字符串拆分成单独的部分？

Question

我决定今晚学习 Python :) 我对 C 很了解（用它写了一个操作系统），所以我不是编程菜鸟，所以 Python 中的一切看起来都很简单，但我不知道如何解决这个问题：假设我有这个地址：

http://example.com/random/folder/path.html

现在我如何从中创建两个字符串，一个包含服务器的“基本”名称，所以在这个例子中它将是

http://example.com/

另一个包含没有最后一个文件名的东西，所以在这个例子中它将是

http://example.com/random/folder/

另外我当然知道可以分别找到第三个和最后一个斜杠，但是有更好的方法吗？

在这两种情况下都有尾部斜线也很酷，但我不在乎，因为它可以很容易地添加。 那么有没有好的、快速的、有效的解决方案呢？ 还是只有“我的”解决方案，找到斜线？

Answer 1

Python 2.x 中的urlparse模块（或 Python 3.x 中的 urllib.parse）将是执行此操作的方法。

>>> from urllib.parse import urlparse
>>> url = 'http://example.com/random/folder/path.html'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'example.com'
>>> parse_object.path
'/random/folder/path.html'
>>> parse_object.scheme
'http'
>>>

如果你想对 URL 下文件的路径做更多的工作，你可以使用posixpath模块：

>>> from posixpath import basename, dirname
>>> basename(parse_object.path)
'path.html'
>>> dirname(parse_object.path)
'/random/folder'

之后，您可以使用posixpath.join将各个部分粘合在一起。

注意：Windows 用户会因为os.path中的路径分隔符而感到窒息。 posixpath模块文档有一个关于 URL 操作的特殊参考，所以一切都很好。

Answer 2

如果这是您的 URL 解析的范围，Python 的内置rpartition将完成这项工作：

>>> URL = "http://example.com/random/folder/path.html"
>>> Segments = URL.rpartition('/')
>>> Segments[0]
'http://example.com/random/folder'
>>> Segments[2]
'path.html'

来自Pydoc ，str.rpartition：

Splits the string at the last occurrence of sep, and returns a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing two empty strings, followed by the string itself

这意味着 rpartition 会为您进行搜索，并在您指定的字符（在本例中为 / ）的最后一次（最右边）出现处拆分字符串。 它返回一个包含以下内容的元组：

(everything to the left of char , the character itself , everything to the right of char)

Answer 3

我没有使用 Python 的经验，但我找到了urlparse 模块，它应该可以完成这项工作。

Answer 4

在 Python 中，很多操作都是使用列表完成的。 Sebasian Dietz 提到的urlparse模块可能会很好地解决您的具体问题，但如果您通常对在字符串中查找斜杠的 Pythonic 方法感兴趣，例如，请尝试如下操作：

url = 'http://example.com/random/folder/path.html'

# Create a list of each bit between slashes
slashparts = url.split('/')

# Now join back the first three sections 'http:', '' and 'example.com'
basename = '/'.join(slashparts[:3]) + '/'

# All except the last one
dirname = '/'.join(slashparts[:-1]) + '/'

print 'slashparts = %s' % slashparts
print 'basename = %s' % basename
print 'dirname = %s' % dirname

这个程序的输出是这样的：

slashparts = ['http:', '', 'example.com', 'random', 'folder', 'path.html']
basename = http://example.com/
dirname = http://example.com/random/folder/

有趣的位是split 、 join 、切片符号 array[A:B] （包括从末尾偏移的负数），以及作为奖励的字符串上的%运算符，以提供printf样式的格式。

Answer 5

sykora 的回答中提到的posixpath模块似乎在我的 Python 设置（Python 2.7.3）中不可用。

根据这篇文章，执行此操作的“正确”方法似乎是使用...

urlparse.urlparse和urlparse.urlunparse可用于分离和重新附加 URL 的基础
os.path的函数可以用来操作路径
urllib.url2pathname和urllib.pathname2url （使路径名操作可移植，因此它可以在 Windows 等上工作）

因此，例如（不包括重新附加基本 URL）...

>>> import urlparse, urllib, os.path
>>> os.path.dirname(urllib.url2pathname(urlparse.urlparse("http://example.com/random/folder/path.html").path))
'/random/folder'

Answer 6

您可以使用 Python 的库furl ：

f = furl.furl("http://example.com/random/folder/path.html")
print(str(f.path))  # '/random/folder/path.html'
print(str(f.path).split("/")) # ['', 'random', 'folder', 'path.html']

要在第一个“/”之后访问单词，请使用：

str(f.path).split("/") # 'random'

如何在 Python 中将 URL 字符串拆分成单独的部分？

问题描述

6 个解决方案

解决方案1
56 2009-01-16 08:14:36

解决方案2
12 2009-01-16 08:11:11

解决方案3
10 2009-01-16 07:49:55

解决方案4
8 2009-01-16 08:08:32

解决方案5
2 2013-02-06 05:35:32

解决方案6
1 2016-12-02 15:58:06

如何在 Python 中将 URL 字符串拆分成单独的部分？

问题描述

6 个解决方案

解决方案1 56 2009-01-16 08:14:36

解决方案2 12 2009-01-16 08:11:11

解决方案3 10 2009-01-16 07:49:55

解决方案4 8 2009-01-16 08:08:32

解决方案5 2 2013-02-06 05:35:32

解决方案6 1 2016-12-02 15:58:06

解决方案1
56 2009-01-16 08:14:36

解决方案2
12 2009-01-16 08:11:11

解决方案3
10 2009-01-16 07:49:55

解决方案4
8 2009-01-16 08:08:32

解决方案5
2 2013-02-06 05:35:32

解决方案6
1 2016-12-02 15:58:06