简体   繁体   中英

How do I make one CSV file out of multiple HTML files in a directory?

I am currently trying to do the following:

  1. Identify all the files that have the text "business class" in it.

  2. Remove the files that do not include the text "business class" from the directory (this step is not important)

  3. Print all the files that have the text "business class" in it to a CSV containing columns for "filename" and "text."

This is what I have so far:

import os
import fnmatch
from pathlib import Path
from bs4 import BeautifulSoup
import csv

directory = "/directory"

remove_files = []

for dirpath, dirs, files in os.walk(directory):
    for filename in fnmatch.filter(files, '*.html'):
        with open(os.path.join(dirpath, filename)) as f:
            html = f.read()
if 'business class' in html:
    lines = [[files, html]]
    header = ['filename', 'text']
    with open("test.csv", "w", newline='') as f:
        writer = csv.writer(f, delimiter=',')
        writer.writerow(header)
        for l in lines:
            writer.writerow(l)
else:
    remove_files.append(os.path.join(dirpath, filename))

for each in remove_files:
    os.remove(each)
    print ('REMOVED: %s' %each)

The problem that I'm encountering is that the code currently only loops through one folder in the directory and prints all four filenames in the "filename" column and all the texts in the "text" column. So I have one row containing four files.

So the CSV file should look something like:

filename,text
filename001.html,this text contains the phrase business class
filename002.html,this text is about business class
filename003.html,this text is about business classes and economy classes

Use lxml to parse the html files then use the csv module to create the csv files.

LXML https://lxml.de/parsing.html#parsing-html

CSV https://docs.python.org/3/library/csv.html

If you are having trouble finding the html files use os.walk

https://docs.python.org/3/library/os.html#os.walk

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM