[英]read xls file in pandas / python: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'
I am trying to open an xls
file (with only one tab) into a pandas dataframe.我正在尝试将xls
文件(只有一个选项卡)打开到 pandas dataframe 中。
It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&rlkey=3aw7whab78jeexbdkthkjzkmu . It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl= 0&rlkey=3aw7whab78jeexbdkthkjzkmu 。
I notice that the top two rows have merged cells and so do some of the columns.我注意到前两行合并了单元格,一些列也是如此。
I have tried several methods (from stack), which all fail.我尝试了几种方法(来自堆栈),但都失败了。
# method 1 - read excel
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file)
print(df)
error: Excel file format cannot be determined, you must specify an engine manually.
错误: Excel file format cannot be determined, you must specify an engine manually.
# method 2 - pip install xlrd and use engine
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_excel(file, engine='xlrd')
print(df)
error: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'
错误: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'
Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'
# method 3 - rename to xlsx and open with openpyxl
file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx"
df = pd.read_excel(file, engine='openpyxl')
print(df)
error: File is not a zip file
(possibly converting, as opposed to renaming, is an option).错误: File is not a zip file
(可以选择转换,而不是重命名)。
# method 4 - use read_xml
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_xml(file)
print(df)
this method actually yields a result, but produces a DataFrame that has no meaning in relation to the sheet.此方法实际上会产生结果,但会产生与工作表没有任何意义的 DataFrame。 presumably one needs to interpret the xml (seems complex)?大概需要解释 xml (似乎很复杂)?
Style Name Table
0 NaN None NaN
1 NaN All funds NaN
# method 5 - use read_table
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_table(file)
print(df)
This method reads the file into a one column (series) DataFrame.此方法将文件读入一列(系列)DataFrame。 So how could one use this info to create a standard 2d DataFrame in the same shape as the xls file?那么如何使用这些信息来创建与 xls 文件形状相同的标准 2d DataFrame 呢?
0 <Workbook xmlns="urn:schemas-microsoft-com:off...
1 <Styles>
2 <Style ss:ID="Default">
3 <Alignment Horizontal="Left"/>
4 </Style>
... ...
226532 </Cell>
226533 </Row>
226534 </Table>
226535 </Worksheet>
226536 </Workbook>
# method 5 - use read_html
file = "C:\\Users\\admin\\Downloads\\product-screener.xls"
df = pd.read_html(file)
print(df)
this returns a blank list []
whereas one might have expected at least a list of DataFrames.这将返回一个空白列表[]
,而人们可能期望至少有一个 DataFrame 列表。
So the question is what is the easiest method to read this file into a dataframe (or similar usable format)?所以问题是将这个文件读入 dataframe (或类似的可用格式)的最简单方法是什么?
Not a complete solution but it should get you started.不是一个完整的解决方案,但它应该让你开始。 The "xls"
file is actually a plain xml
file in the SpreadsheetML
format. "xls"
文件实际上是SpreadsheetML
格式的普通xml
文件。 Change the file extension to .xml
an view it in your internet browser, the structure (at least of the give file) is rather straightforward.将文件扩展名更改为.xml
并在您的互联网浏览器中查看它,结构(至少是给定文件)相当简单。
The following reads the data contents into a pandas DataFrame:下面将数据内容读入pandas DataFrame:
import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse('product-screener.xls')
root = tree.getroot()
data = [[c[0].text for c in r] for r in root[1][0][2:]]
types = [c[0].get('{urn:schemas-microsoft-com:office:spreadsheet}Type') for c in root[1][0][2]]
df = pd.DataFrame(data)
df = df.replace('-', None)
for c in df.columns:
if types[c] == 'Number':
df[c] = pd.to_numeric(df[c])
elif types[c] == 'DateTime':
df[c] = pd.to_datetime(df[c])
Getting the column names from rows 0 and 1 is a bit more involved due to the merged cells - I leave it as an exercise for the reader.由于合并的单元格,从第 0 行和第 1 行获取列名涉及更多一些 - 我将其留给读者作为练习。
I am posting the full solution here which contains the above approved solution (by @Stef) plus the final addition of the headers into the DataFrame.我在这里发布了完整的解决方案,其中包含上述批准的解决方案(由@Stef 提供)以及将标题最终添加到 DataFrame 中。
'''
get xls file
convert to xml
parse into dataframe
add headers
'''
import pandas as pd
import xml.etree.ElementTree as ET
import shutil
file_xls = "C:\\Users\\admin\\Downloads\\product-screener.xls"
file_xml = 'C:\\Users\\admin\\Downloads\\product-screener.xml'
shutil.copyfile(file_xls, file_xml)
tree = ET.parse(file_xml)
root = tree.getroot()
data = [[c[0].text for c in r] for r in root[1][0][2:]]
types = [c[0].get('{urn:schemas-microsoft-com:office:spreadsheet}Type') for c in root[1][0][2]]
df = pd.DataFrame(data)
df = df.replace('-', None)
for c in df.columns:
if types[c] == 'Number':
df[c] = pd.to_numeric(df[c])
elif types[c] == 'DateTime':
df[c] = pd.to_datetime(df[c])
print(df)
headers = [[c[0].text for c in r] for r in root[1][0][:2]]
# print(headers[0])
# print(len(headers[0]))
# print()
# print(headers[1])
# print(len(headers[1]))
# print()
# upto column (AF) comes from headers[0]
df_headers = headers[0][0:32]
# the next 9 are discrete
x_list = ['discrete: ' + s for s in headers[1][0:9] ]
df_headers = df_headers + x_list
# the next 10 are annualised
x_list = ['annualised: ' + s for s in headers[1][9:19] ]
df_headers = df_headers + x_list
# the next 10 are cumulative
x_list = ['cumulative: ' + s for s in headers[1][19:29] ]
df_headers = df_headers + x_list
# the next 9 are calendar
x_list = ['calendar: ' + s for s in headers[1][29:38] ]
df_headers = df_headers + x_list
# the next 5 are portfolio characteristics (metrics)
x_list = ['metrics: ' + s for s in headers[1][38:43] ]
df_headers = df_headers + x_list
# the next 6 are portfolio characteristics
x_list = ['characteristics: ' + s for s in headers[1][43:49] ]
df_headers = df_headers + x_list
# the final 5 are sustainability characteristics
x_list = ['sustain: ' + s for s in headers[1][49:54] ]
df_headers = df_headers + x_list
print(df_headers)
# add headers to dataframe
df.columns = df_headers
print(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.