How do I read all html-files in a directory recursively?

Question

I am trying to get all html-files Doctype printed to a txt-file. I have no experience in Python, so bear with me a bit. :)

Final script is supposed to delete elements from the html-file depending on the html-version given in the Doctype set in the html-file. I've attempted to list files in PHP as well, and it works to some extent. I think Python is a better choice for this task.

The script below is what I've got right now, but I cant figure out how to write a "for each" to get the Doctype of every html-file in the arkivet folder recursively. I currently only prints the filename and extension, and I don't know how to get the path to it, or how to utilize BeautifulSoup to edit and get info out of the files.

import fnmatch
from urllib.request import urlopen as uReq
import os
from bs4 import BeautifulSoup as soup
from bs4 import Doctype

files = ['*.html']
matches = []

for root, dirnames, filenames in os.walk("arkivet"):
    for extensions in files:
        for filename in fnmatch.filter(filenames, extensions):
            matches.append(os.path.join(root, filename))
            print(filename)

matches is an array, but I am not sure how to handle it properly in Python. I would like to print the foldernames, filenames with extension and it's doctype into a text-file in root.

Script runs in CLI on a local Vagrant Debian server with Python 3.5 (Python 2.x present too). All files and folders exist in folder called arkivet (archive) under servers public root.

Any help appreciated! I'm stuck here :)

Answer 1

Vikas's answer is probably what you are asking for, but in case he interpreted the question incorrectly, it's worth it to know that you have access to all three of those variables as you're looping through, root, dirnames, and filenames. You are currently printing just the basefile name:

print(filename)

It is also possible to print the full path instead:

print(os.path.join(root, filename))

Vikas solved the lack of directory name by using a different function (os.listdir), but I think that will lose the ability to recurse.

A combination of os.walk as you posted, and reading the interior of the file with open as Vikas posted is perhaps what you are going for?

Answer 2

As you did not mark any of the answers solutions, I'm guessing you never quite got you answer. Here's a chunk of code that recursively searches for files, prints the full filepath, and shows the Doctype string in the html file if it exists.

import os
from bs4 import BeautifulSoup, Doctype

directory = '/home/brian/Code/sof'
for root, dirnames, filenames in os.walk(directory):
    for filename in filenames:
        if filename.endswith('.html'):
            fname = os.path.join(root, filename)
            print('Filename: {}'.format(fname))
            with open(fname) as handle:
                soup = BeautifulSoup(handle.read(), 'html.parser')
                for item in soup.contents:
                    if isinstance(item, Doctype):
                        print('Doctype: {}'.format(item))
                        break

Answer 3

If you want to read all the html files in a particular directory you can try this one :

import os
from bs4 import BeautifulSoup

directory ='/Users/xxxxx/Documents/sample/'
for filename in os.listdir(directory):
     if filename.endswith('.html'):
         fname = os.path.join(directory,filename)
         with open(fname, 'r') as f:
             soup = BeautifulSoup(f.read(),'html.parser')
             # parse the html as you wish

How do I read all html-files in a directory recursively?

Question

3 answers

solution1
2 2018-01-23 12:56:13

solution2
1 ACCPTED 2018-01-25 13:04:29

solution3
0 2018-01-23 12:49:58

How do I read all html-files in a directory recursively?

Question

3 answers

solution1 2 2018-01-23 12:56:13

solution2 1 ACCPTED 2018-01-25 13:04:29

solution3 0 2018-01-23 12:49:58

solution1
2 2018-01-23 12:56:13

solution2
1 ACCPTED 2018-01-25 13:04:29

solution3
0 2018-01-23 12:49:58