使用Python将字符串URL拆分为单词

Question

如何从python中的字符串（URL）中获得各种单词？ 从如下网址：

http://www.sample.com/level1/level2/index.html?id=1234

我想得到这样的词：

http, www, sample, com, level1, level2, index, html, id, 1234

任何使用python的解决方案。

谢谢。

Answer 1

这是您可能会对所有网址执行的操作

import re
def getWordsFromURL(url):
    return re.compile(r'[\:/?=\-&]+',re.UNICODE).split(url)

现在您可以将其用作

url = "http://www.sample.com/level1/level2/index.html?id=1234"
words = getWordsFromURL(url)

Answer 2

只是根据非字母数字的最大序列进行正则表达式拆分：

import re
l = re.split(r"\W+","http://www.sample.com/level1/level2/index.html?id=1234")
print(l)

收益率：

['http', 'www', 'sample', 'com', 'level1', 'level2', 'index', 'html', 'id', '1234']

这很简单，但是正如有人指出的那样，如果URL名称中有_ ， - ，...，则不起作用。 因此，较不有趣的解决方案是列出所有可能分隔路径部分的令牌：

l = re.split(r"[/:\.?=&]+","http://stackoverflow.com/questions/41935748/splitting-a-stri‌ng-url-into-words-us‌ing-python")

（我承认我可能已经忘记了一些分隔符号）