How do I make one CSV file out of multiple HTML files in a directory?

Question

I am currently trying to do the following:

Identify all the files that have the text "business class" in it.
Remove the files that do not include the text "business class" from the directory (this step is not important)
Print all the files that have the text "business class" in it to a CSV containing columns for "filename" and "text."

This is what I have so far:

import os
import fnmatch
from pathlib import Path
from bs4 import BeautifulSoup
import csv

directory = "/directory"

remove_files = []

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = f.read()
if 'business class' in html:
    lines = [[files, html]]
    header = ['filename', 'text']
    with open("test.csv", "w", newline='') as f:
        writer = csv.writer(f, delimiter=',')
        writer.writerow(header)
        for l in lines:
            writer.writerow(l)
else:
    remove_files.append(os.path.join(dirpath, filename))

for each in remove_files:
    os.remove(each)
    print ('REMOVED: %s' %each)

The problem that I'm encountering is that the code currently only loops through one folder in the directory and prints all four filenames in the "filename" column and all the texts in the "text" column. So I have one row containing four files.

So the CSV file should look something like:

filename,text
filename001.html,this text contains the phrase business class
filename002.html,this text is about business class
filename003.html,this text is about business classes and economy classes

Answer 1

Use lxml to parse the html files then use the csv module to create the csv files.

LXML https://lxml.de/parsing.html#parsing-html

CSV https://docs.python.org/3/library/csv.html

If you are having trouble finding the html files use os.walk

https://docs.python.org/3/library/os.html#os.walk

How do I make one CSV file out of multiple HTML files in a directory?

Question

1 answers

solution1
0 2019-09-19 17:36:25

How do I make one CSV file out of multiple HTML files in a directory?

Question

1 answers

solution1 0 2019-09-19 17:36:25

solution1
0 2019-09-19 17:36:25