简体   繁体   English

如何在 Python 中将 URL 字符串拆分成单独的部分?

[英]How can I split a URL string up into separate parts in Python?

I decided that I'll learn Python tonight:) I know C pretty well (wrote an OS in it), so I'm not a noob in programming, so everything in Python seems pretty easy, but I don't know how to solve this problem: let's say I have this address:我决定今晚学习 Python :) 我对 C 很了解(用它写了一个操作系统),所以我不是编程菜鸟,所以 Python 中的一切看起来都很简单,但我不知道如何解决这个问题:假设我有这个地址:

http://example.com/random/folder/path.html http://example.com/random/folder/path.html

Now how can I create two strings from this, one containing the "base" name of the server, so in this example it would be现在我如何从中创建两个字符串,一个包含服务器的“基本”名称,所以在这个例子中它将是

http://example.com/ http://example.com/

and another containing the thing without the last filename, so in this example it would be另一个包含没有最后一个文件名的东西,所以在这个例子中它将是

http://example.com/random/folder/ http://example.com/random/folder/

Also I of course know the possibility to just find the third and last slash respectively, but is there a better way?另外我当然知道可以分别找到第三个和最后一个斜杠,但是有更好的方法吗?

Also it would be cool to have the trailing slash in both cases, but I don't care since it can be added easily.在这两种情况下都有尾部斜线也很酷,但我不在乎,因为它可以很容易地添加。 So is there a good, fast, effective solution for this?那么有没有好的、快速的、有效的解决方案呢? Or is there only "my" solution, finding the slashes?还是只有“我的”解决方案,找到斜线?

The urlparse module in Python 2.x (or urllib.parse in Python 3.x) would be the way to do it. Python 2.x 中的urlparse模块(或 Python 3.x 中的 urllib.parse)将是执行此操作的方法。

>>> from urllib.parse import urlparse
>>> url = 'http://example.com/random/folder/path.html'
>>> parse_object = urlparse(url)
>>> parse_object.netloc
'example.com'
>>> parse_object.path
'/random/folder/path.html'
>>> parse_object.scheme
'http'
>>>

If you wanted to do more work on the path of the file under the URL, you can use the posixpath module:如果你想对 URL 下文件的路径做更多的工作,你可以使用posixpath模块:

>>> from posixpath import basename, dirname
>>> basename(parse_object.path)
'path.html'
>>> dirname(parse_object.path)
'/random/folder'

After that, you can use posixpath.join to glue the parts together.之后,您可以使用posixpath.join将各个部分粘合在一起。

Note: Windows users will choke on the path separator in os.path .注意:Windows 用户会因为os.path中的路径分隔符而感到窒息。 The posixpath module documentation has a special reference to URL manipulation, so all's good. posixpath模块文档有一个关于 URL 操作的特殊参考,所以一切都很好。

If this is the extent of your URL parsing, Python's inbuilt rpartition will do the job:如果这是您的 URL 解析的范围,Python 的内置rpartition将完成这项工作:

>>> URL = "http://example.com/random/folder/path.html"
>>> Segments = URL.rpartition('/')
>>> Segments[0]
'http://example.com/random/folder'
>>> Segments[2]
'path.html'

From Pydoc , str.rpartition:来自Pydoc ,str.rpartition:

Splits the string at the last occurrence of sep, and returns a 3-tuple containing the part before the separator, the separator itself, and the part after the separator. If the separator is not found, return a 3-tuple containing two empty strings, followed by the string itself

What this means is that rpartition does the searching for you, and splits the string at the last (right most) occurrence of the character you specify (in this case / ).这意味着 rpartition 会为您进行搜索,并在您指定的字符(在本例中为 / )的最后一次(最右边)出现处拆分字符串。 It returns a tuple containing:它返回一个包含以下内容的元组:

(everything to the left of char , the character itself , everything to the right of char)

I have no experience with Python, but I found the urlparse module , which should do the job.我没有使用 Python 的经验,但我找到了urlparse 模块,它应该可以完成这项工作。

In Python, a lot of operations are done using lists.在 Python 中,很多操作都是使用列表完成的。 The urlparse module mentioned by Sebasian Dietz may well solve your specific problem, but if you're generally interested in Pythonic ways to find slashes in strings, for example, try something like this: Sebasian Dietz 提到urlparse模块可能会很好地解决您的具体问题,但如果您通常对在字符串中查找斜杠的 Pythonic 方法感兴趣,例如,请尝试如下操作:

url = 'http://example.com/random/folder/path.html'

# Create a list of each bit between slashes
slashparts = url.split('/')

# Now join back the first three sections 'http:', '' and 'example.com'
basename = '/'.join(slashparts[:3]) + '/'

# All except the last one
dirname = '/'.join(slashparts[:-1]) + '/'

print 'slashparts = %s' % slashparts
print 'basename = %s' % basename
print 'dirname = %s' % dirname

The output of this program is this:这个程序的输出是这样的:

slashparts = ['http:', '', 'example.com', 'random', 'folder', 'path.html']
basename = http://example.com/
dirname = http://example.com/random/folder/

The interesting bits are split , join , the slice notation array[A:B] (including negatives for offsets-from-the-end) and, as a bonus, the % operator on strings to give printf -style formatting.有趣的位是splitjoin 、切片符号 array[A:B] (包括从末尾偏移的负数),以及作为奖励的字符串上的%运算符,以提供printf样式的格式。

It seems like the posixpath module mentioned in sykora's answer is not available in my Python setup (Python 2.7.3). sykora 的回答中提到posixpath模块似乎在我的 Python 设置(Python 2.7.3)中不可用。

As per this article , it seems that the "proper" way to do this would be using...根据这篇文章,执行此操作的“正确”方法似乎是使用...

  • urlparse.urlparse and urlparse.urlunparse can be used to detach and reattach the base of the URL urlparse.urlparseurlparse.urlunparse可用于分离和重新附加 URL 的基础
  • The functions of os.path can be used to manipulate the path os.path的函数可以用来操作路径
  • urllib.url2pathname and urllib.pathname2url (to make path name manipulation portable, so it can work on Windows and the like) urllib.url2pathnameurllib.pathname2url (使路径名操作可移植,因此它可以在 Windows 等上工作)

So for example (not including reattaching the base URL)...因此,例如(不包括重新附加基本 URL)...

>>> import urlparse, urllib, os.path
>>> os.path.dirname(urllib.url2pathname(urlparse.urlparse("http://example.com/random/folder/path.html").path))
'/random/folder'

You can use Python's library furl :您可以使用 Python 的库furl

f = furl.furl("http://example.com/random/folder/path.html")
print(str(f.path))  # '/random/folder/path.html'
print(str(f.path).split("/")) # ['', 'random', 'folder', 'path.html']

To access word after first "/", use:要在第一个“/”之后访问单词,请使用:

str(f.path).split("/") # 'random'

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM