如何递归读取目录中的所有html文件？

Question

我试图将所有html文件Doctype打印到txt文件。 我没有Python的经验，所以请耐心等待一下。 :)

最终脚本应该从html文件中删除元素，具体取决于html文件中设置的Doctype中给出的html版本。 我也试图在PHP中列出文件，它在某种程度上起作用。 我认为Python是这项任务的更好选择。

下面的脚本是我现在所拥有的，但是我无法弄清楚如何写一个“for each”来递归地获取arkivet文件夹中每个html文件的Doctype。 我目前只打印文件名和扩展名，我不知道如何获取它的路径，或者如何利用BeautifulSoup编辑和从文件中获取信息。

import fnmatch
from urllib.request import urlopen as uReq
import os
from bs4 import BeautifulSoup as soup
from bs4 import Doctype

files = ['*.html']
matches = []

for root, dirnames, filenames in os.walk("arkivet"):
    for extensions in files:
        for filename in fnmatch.filter(filenames, extensions):
            matches.append(os.path.join(root, filename))
            print(filename)

matches是一个数组，但我不确定如何在Python中正确处理它。 我想将带有扩展名的文件名，文件名和文档类型打印到root文本文件中。

脚本在使用Python 3.5的本地Vagrant Debian服务器上的CLI中运行（Python 2.x也存在）。 所有文件和文件夹都存在于服务器public root下名为arkivet（archive）的文件夹中。

任何帮助赞赏！ 我被困在这里:)

Answer 1

Vikas的答案可能就是你所要求的，但是如果他不正确地解释了这个问题，那么你知道在循环，根，dirnames和文件名中你可以访问所有这三个变量是值得的。 您当前只打印基本文件名称：

print(filename)

也可以打印完整路径：

print(os.path.join(root, filename))

Vikas通过使用不同的函数（os.listdir）解决了目录名称的缺失，但我认为这将失去递归的能力。

你发布的os.walk的组合，以及打开Vikas的文件阅读文件的内部可能是你想要的？

Answer 2

由于你没有标记任何答案解决方案，我猜你从来没有得到你的回答。 这里有一大块代码递归搜索文件，打印完整的文件路径，并在html文件中显示Doctype字符串（如果存在）。

import os
from bs4 import BeautifulSoup, Doctype

directory = '/home/brian/Code/sof'
for root, dirnames, filenames in os.walk(directory):
    for filename in filenames:
        if filename.endswith('.html'):
            fname = os.path.join(root, filename)
            print('Filename: {}'.format(fname))
            with open(fname) as handle:
                soup = BeautifulSoup(handle.read(), 'html.parser')
                for item in soup.contents:
                    if isinstance(item, Doctype):
                        print('Doctype: {}'.format(item))
                        break

Answer 3

如果要读取特定目录中的所有html文件，可以尝试以下方法：

import os
from bs4 import BeautifulSoup

directory ='/Users/xxxxx/Documents/sample/'
for filename in os.listdir(directory):
     if filename.endswith('.html'):
         fname = os.path.join(directory,filename)
         with open(fname, 'r') as f:
             soup = BeautifulSoup(f.read(),'html.parser')
             # parse the html as you wish

如何递归读取目录中的所有html文件？

问题描述

3 个解决方案

解决方案1
2 2018-01-23 12:56:13

解决方案2
1 已采纳 2018-01-25 13:04:29

解决方案3
0 2018-01-23 12:49:58

如何递归读取目录中的所有html文件？

问题描述

3 个解决方案

解决方案1 2 2018-01-23 12:56:13

解决方案2 1 已采纳 2018-01-25 13:04:29

解决方案3 0 2018-01-23 12:49:58

解决方案1
2 2018-01-23 12:56:13

解决方案2
1 已采纳 2018-01-25 13:04:29

解决方案3
0 2018-01-23 12:49:58