I am pretty new to python, so my question might sound silly. I have downloaded several 'Well Completion' files from this link: https://wwwapps.emnrd.nm.gov/OCD/OCDPermitting/Reporting/Activity/WeeklyActivity.aspx . Now I want to combine all of the files into 1 excel sheet using Python and export it. So far, I have been pretty unsuccesful and I am hoping I will get an answer here. The problem lies in the fact that the files got downloaded in such a way that it opens with excel but it is actually in html format.
The code that I have used to combine the files is:
import os
from bs4 import BeautifulSoup
output_doc = BeautifulSoup()
output_doc.append(output_doc.new_tag("html"))
output_doc.html.append(output_doc.new_tag("body"))
data_folder= r'C:\Users\dtsar\OneDrive\Desktop\another well completion'
for file in os.listdir(data_folder):
if not file.lower().endswith('.html'):
continue
with open(file, 'r') as html_file:
output_doc.body.extend(BeautifulSoup(html_file.read(), "html.parser").body)
print(output_doc.prettify())
but the response I got is: <html>
<body>
</body>
</html>
I cannot understand where I am going wrong. The next step would be to export the data into an excel format but I cannot seem to combine all the files together in the first place. Any ideas?
So, I figured out the solution to change the broken excel files into proper.xlsx format. The code is below in case anybody needs it:
import os
import pandas as pd
from bs4 import BeautifulSoup
folder_path = r'path to the folder'
for filename in os.listdir(folder_path):
if filename.endswith(".xls"):
file_path = os.path.join(folder_path, filename)
with open(file_path) as f:
soup = BeautifulSoup(f, 'html.parser')
tables = soup.find_all('table')
writer = pd.ExcelWriter(file_path.replace(".xls", ".xlsx"), engine='openpyxl')
for i, table in enumerate(tables):
caption = table.find('caption')
if caption:
sheet_name = caption.get_text().strip()
else:
sheet_name = 'Sheet{}'.format(i+1)
df = pd.read_html(str(table))[0]
df.to_excel(writer, sheet_name=sheet_name, index=False)
writer.save()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.