如何使该python脚本遍历目录树？

Question

I have a python script 我有一个python脚本

$ cat ~/script.py
import sys
from lxml import etree
from lxml.html import parse
doc = parse(sys.argv[1])
title = doc.find('//title')
title.text = span2.text.strip()
print etree.tostring(doc)

I can run the script on an individual file by issuing something like 我可以通过发出类似的命令在单个文件上运行脚本

$ python script.py foo.html > new-foo.html

My problem is that I have a directory ~/webpage that contains hundreds of .html files scattered throughout sub-directories. 我的问题是我有一个目录~/webpage ，其中包含数百个散布在子目录中的.html文件。 I would like to run ~/script.py on all of these html files. 我想在所有这些html文件上运行~/script.py 。 I am currently doing this with 我目前正在与

$ find ~/webpage/ -name "*.html" -exec sh -c 'python ~/script.py {} > {}-new' \;

However, this creates a new file for each html file in ~/webpage and I actually want the original file edited. 但是，这会为~/webpage每个html文件创建一个新文件，而我实际上希望编辑原始文件。

Is this possible to do from within python? 这可以从python内部完成吗？ Maybe with something like os.walk ? 也许带有os.walk东西？

Answer 1

The os module in python has a function specifically for walking down directories python中的os模块具有专门用于遍历目录的功能

Generate the file names in a directory tree by walking the tree either top-down or bottom-up. 通过自上而下或自下而上移动目录树来生成文件名。 For each directory in the tree rooted at directory top (including top itself), it yields a 3-tuple (dirpath, dirnames, filenames). 对于以目录顶部（包括顶部本身）为根的树中的每个目录，它都会生成一个三元组（目录路径，目录名，文件名）。

import os
import sys
from lxml import etree
from lxml.html import parse


def parse_file(file_name):
    doc = parse(file_name)
    title = doc.find('//title')
    title.text = span2.text.strip()
    print etree.tostring(doc)


for root, dirs, files in os.walk('/path/to/webpages'):
    for name in files:
        parse_file(os.path.join(root, name))

Answer 2

import os

def process(file_name):
    with open(file_name) as readonly_file:
        print "Do something with %s ,size %d" % (file_name, len(readonly_file.read()))

def traverse(directory, callback=process):
    for dirpath, dirnames, filenames in os.walk(directory):
        for f in filenames:
            path = os.path.abspath(os.path.join(dirpath, f))
            callback(path)

print traverse('./')

please rewrite process function according to you own logic, this callback accept absolute path as only parameter. 请根据您自己的逻辑重写过程函数，此回调接受绝对路径作为唯一参数。

if you want process specific file only: 如果只想处理特定文件：

def traverse(directory, callback=process, file_type="txt"):
    for dirpath, dirnames, filenames in os.walk(directory):
        for f in filenames:
            path = os.path.abspath(os.path.join(dirpath, f))
            if path.endswith(file_type):
                callback(path)

如何使该python脚本遍历目录树？

问题描述

2 个解决方案

解决方案1
2 2016-01-12 03:45:46

解决方案2
2 已采纳 2016-01-12 03:56:22

如何使该python脚本遍历目录树？

问题描述

2 个解决方案

解决方案1 2 2016-01-12 03:45:46

解决方案2 2 已采纳 2016-01-12 03:56:22

解决方案1
2 2016-01-12 03:45:46

解决方案2
2 已采纳 2016-01-12 03:56:22