[英]Extracting directory structure from urls
我想从网站网址中提取目录层次结构。 并非所有网站都符合目录结构。 对于这样做的网站(下),我希望能够创建一个反映目录层次结构的python字典(下)。 我该如何构建一个Python脚本,该脚本可以将网址中的结构提取到字典中?
Raw data:
http://www.ex.com
http://www.ex.com/product_cat_1/
http://www.ex.com/product_cat_1/item_1
http://www.ex.com/product_cat_1/item_2
http://www.ex.com/product_cat_2/
http://www.ex.com/product_cat_2/item_1
http://www.ex.com/product_cat_2/item_2
http://www.ex.com/terms_and_conditions/
http://www.ex.com/Media_Center
Example output:
{'url':'http://www.ex.com', 'sub_dir':[
{'url':'http://www.ex.com/product_cat_1/', 'sub_dir':[
{'url':'http://www.ex.com/product_cat_1/item_1'}, {'url':'http://www.ex.com/product_cat_1/item_2'}]},
{'url':'http://www.ex.com/product_cat_2/', 'sub_dir':[
{'url':'http://www.ex.com/product_cat_2/item_1'},
'url':'http://www.ex.com/product_cat_2/item_2']},
{'url':'http://www.ex.com/terms_and_conditions/'},
{'url':'http://www.ex.com/Media_Center'},
]}
For each item:
if it is a subdir of something else:
add it to the subdirectory list of that item
otherwise:
add it to the main list.
这是一种输出格式略有不同的解决方案。 首先,目录结构可以表示为嵌套dict,而不是sub_dir
和url
键,其中空dict是空目录或文件(树中的叶子)。
例如,输入字符串
"www.foo.com/images/hellokitty.jpg"
"www.foo.com/images/t-rex.jpg"
"www.foo.com/videos/"
将映射这样的目录结构:
{
"www.foo.com": {
"images": {
"hellokitty.jpg": {},
"t-rex.jpg": {}
},
"videos": {}
}
}
使用此模型,解析数据字符串是for循环,if语句和某些字符串函数的简单组合。
码:
raw_data = [
"http://www.ex.com",
"http://www.ex.com/product_cat_1/",
"http://www.ex.com/product_cat_1/item_1",
"http://www.ex.com/product_cat_1/item_2",
"http://www.ex.com/product_cat_2/",
"http://www.ex.com/product_cat_2/item_1",
"http://www.ex.com/product_cat_2/item_2",
"http://www.ex.com/terms_and_conditions/",
"http://www.ex.com/Media_Center"
]
root = {}
for url in raw_data:
last_dir = root
for dir_name in url.lstrip("htp:/").rstrip("/").split("/"):
if dir_name in last_dir:
last_dir = last_dir[dir_name]
else:
last_dir[dir_name] = {}
输出:
{
"www.ex.com": {
"Media_Center": {},
"terms_and_conditions": {},
"product_cat_1": {
"item_2": {},
"item_1": {}
},
"product_cat_2": {
"item_2": {},
"item_1": {}
}
}
}
这是一个直接产生请求输出的脚本(注意,它从文件中获取输入;将文件名指定为脚本的第一个(也是唯一的)命令行参数)。 注意 使用Butch的解决方案,然后可能转换为这种格式,可能会变得更加干净和快捷。
#!/usr/bin/env python
from urlparse import urlparse
from itertools import ifilter
def match(init, path):
return path.startswith(init) and init[-1] == "/"
def add_url(tree, url):
while True:
if tree["url"] == url:
return
f = list(ifilter(lambda t: match(t["url"], url),
tree.get("sub_dir", [])))
if len(f) > 0:
tree = f[0]
continue
sub = {"url": url}
tree.setdefault("sub_dir", []).append(sub)
return
def build_tree(urls):
urls.sort()
url0 = urls[0]
tree = {'url': url0}
for url in urls[1:]:
add_url(tree, url)
return tree
def read_urls(filename):
urls = []
with open(filename) as fd:
for line in fd:
url = urlparse(line.strip())
urls.append("".join([url.scheme, "://", url.netloc, url.path]))
return urls
if __name__ == "__main__":
import sys
urls = read_urls(sys.argv[1])
tree = build_tree(urls)
print("%r" % tree)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.