简体   繁体   中英

Is there a way to combine html files and download them as excel file in python?

I am pretty new to python, so my question might sound silly. I have downloaded several 'Well Completion' files from this link: https://wwwapps.emnrd.nm.gov/OCD/OCDPermitting/Reporting/Activity/WeeklyActivity.aspx . Now I want to combine all of the files into 1 excel sheet using Python and export it. So far, I have been pretty unsuccesful and I am hoping I will get an answer here. The problem lies in the fact that the files got downloaded in such a way that it opens with excel but it is actually in html format.

The code that I have used to combine the files is:

import os
from bs4 import BeautifulSoup
output_doc = BeautifulSoup()
output_doc.append(output_doc.new_tag("html"))
output_doc.html.append(output_doc.new_tag("body"))
data_folder= r'C:\Users\dtsar\OneDrive\Desktop\another well completion'
for file in os.listdir(data_folder):
    if not file.lower().endswith('.html'):
        continue

    with open(file, 'r') as html_file:
        output_doc.body.extend(BeautifulSoup(html_file.read(), "html.parser").body)

print(output_doc.prettify())

but the response I got is: <html> <body> </body> </html>

I cannot understand where I am going wrong. The next step would be to export the data into an excel format but I cannot seem to combine all the files together in the first place. Any ideas?

So, I figured out the solution to change the broken excel files into proper.xlsx format. The code is below in case anybody needs it:

import os
import pandas as pd
from bs4 import BeautifulSoup

folder_path = r'path to the folder'

for filename in os.listdir(folder_path):
    if filename.endswith(".xls"):
        file_path = os.path.join(folder_path, filename)
        with open(file_path) as f:
            soup = BeautifulSoup(f, 'html.parser')
        tables = soup.find_all('table')
        writer = pd.ExcelWriter(file_path.replace(".xls", ".xlsx"), engine='openpyxl')
        for i, table in enumerate(tables):
            caption = table.find('caption')
            if caption:
                sheet_name = caption.get_text().strip()
            else:
                sheet_name = 'Sheet{}'.format(i+1)
            df = pd.read_html(str(table))[0]
            df.to_excel(writer, sheet_name=sheet_name, index=False)
            writer.save()

        

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM