简体   繁体   English

Python读取SAS生成的XML类型.xls文件

[英]Python read SAS generated XML type .xls file

I am trying to extract tabs from hundreds of SAS generated .xls files. 我正在尝试从数百个SAS生成的.xls文件中提取选项卡。 I tried the following approach without luck. 我没有运气就尝试了以下方法。 My version of xlrd is 0.9.2. 我的xlrd版本是0.9.2。

import xlrd 
book = xlrd.open_workbook('out_1.xls')

The error message is: 错误消息是:

Traceback (most recent call last):[Finished in 0.2s with exit code 1]
  File "I:\Dropbox\Sas data\sacwin\test.py", line 3, in <module>
    book = xlrd.open_workbook('out_1.xls') # Open an .xls file
  File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 435, in open_workbook
    ragged_rows=ragged_rows,
  File "C:\Python27\lib\site-packages\xlrd\book.py", line 91, in open_workbook_xls
    biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
  File "C:\Python27\lib\site-packages\xlrd\book.py", line 1258, in getbof
    bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
  File "C:\Python27\lib\site-packages\xlrd\book.py", line 1252, in bof_error
    raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'

Once I opened the .xls file in an editor the header looks like: 在编辑器中打开.xls文件后,标题如下:

<?xml version="1.0" encoding="windows-1252"?>

<?mso-application progid="Excel.Sheet"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
          xmlns:x="urn:schemas-microsoft-com:office:excel"
          xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
          xmlns:html="http://www.w3.org/TR/REC-html40">
<DocumentProperties xmlns="urn:schemas-microsoft-com:office">

Would you mind giving me some suggestions on how to parse these files? 您介意给我一些有关如何解析这些文件的建议吗? Thanks! 谢谢!

I'm looking for a solution to this problem as well. 我也在寻找解决这个问题的方法。 I can tell you that the file format is xml but pre-dates Excel 2007 'Office Open XML (ECMA-376)' format (I think it's SpreadsheetML), so it's not supported by xlrd. 我可以告诉您,文件格式是xml,但早于Excel 2007'Office Open XML(ECMA-376)'格式(我认为是SpreadsheetML),因此xlrd不支持该格式。

If there's no python library available and you have good prior knowledge of the structure of the files you need to process I'd just use an xml reader. 如果没有可用的python库,并且您对要处理的文件结构有很好的先验知识,那么我只会使用xml阅读器。

Regards Dave 问候戴夫

读取 pandas / python 中的 xls 文件:不支持的格式,或损坏的文件:预期的 BOF 记录; 找到 b'\xef\xbb\xbf <!--?xml'</div--><div id="text_translate"><p> 我正在尝试将xls文件(只有一个选项卡)打开到 pandas dataframe 中。</p><p> It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: <a href="https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu" rel="nofollow noreferrer">https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl= 0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu</a> 。</p><p> 我注意到前两行合并了单元格,一些列也是如此。</p><p> 我尝试了几种方法(来自堆栈),但都失败了。</p><pre> # method 1 - read excel file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file) print(df)</pre><p> 错误: Excel file format cannot be determined, you must specify an engine manually.</p><pre> # method 2 - pip install xlrd and use engine file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file, engine='xlrd') print(df)</pre><p> 错误: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml' Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml'</p><pre> # method 3 - rename to xlsx and open with openpyxl file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx" df = pd.read_excel(file, engine='openpyxl') print(df)</pre><p> 错误: File is not a zip file (可以选择转换,而不是重命名)。</p><pre> # method 4 - use read_xml file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_xml(file) print(df)</pre><p> 此方法实际上会产生结果,但会产生与工作表没有任何意义的 DataFrame。 大概需要解释 xml (似乎很复杂)?</p><pre> Style Name Table 0 NaN None NaN 1 NaN All funds NaN # method 5 - use read_table file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_table(file) print(df)</pre><p> 此方法将文件读入一列(系列)DataFrame。 那么如何使用这些信息来创建与 xls 文件形状相同的标准 2d DataFrame 呢?</p><pre> 0 &lt;Workbook xmlns="urn:schemas-microsoft-com:off... 1 &lt;Styles&gt; 2 &lt;Style ss:ID="Default"&gt; 3 &lt;Alignment Horizontal="Left"/&gt; 4 &lt;/Style&gt;... ... 226532 &lt;/Cell&gt; 226533 &lt;/Row&gt; 226534 &lt;/Table&gt; 226535 &lt;/Worksheet&gt; 226536 &lt;/Workbook&gt; # method 5 - use read_html file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_html(file) print(df)</pre><p> 这将返回一个空白列表[] ,而人们可能期望至少有一个 DataFrame 列表。</p><p> 所以问题是将这个文件读入 dataframe (或类似的可用格式)的最简单方法是什么?</p></div> - read xls file in pandas / python: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在python中读取包含xml数据的xls文件 - Read xls file containing xml data in python 如何读取/解析 Python(XML 模式)中的 an.xls 文件 - How to read/parse an .xls file in Python (XML schema) 用pandas读取Excel XML.xls文件 - Read Excel XML .xls file with pandas 在Python中强制将xml文件保存为xls格式 - Force save an xml file to xls format in Python 尝试使用 Python 解析 XLS (XML) 文件 - Attempting to Parse an XLS (XML) File Using Python 尝试在 python 上读取被阻止的 xls 文件时出错 - Error trying to read blockede xls file on python 无法使用xlrd在python中读取.xls文件 - Unable to read .xls file in python using xlrd 逐行读取.xls文件数据 - Read .xls file data row by row python 从 python 中的 URL 读取 xls 文件 - Read xls file from a URL in python 读取 pandas / python 中的 xls 文件:不支持的格式,或损坏的文件:预期的 BOF 记录; 找到 b'\xef\xbb\xbf <!--?xml'</div--><div id="text_translate"><p> 我正在尝试将xls文件(只有一个选项卡)打开到 pandas dataframe 中。</p><p> It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: <a href="https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu" rel="nofollow noreferrer">https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl= 0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu</a> 。</p><p> 我注意到前两行合并了单元格,一些列也是如此。</p><p> 我尝试了几种方法(来自堆栈),但都失败了。</p><pre> # method 1 - read excel file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file) print(df)</pre><p> 错误: Excel file format cannot be determined, you must specify an engine manually.</p><pre> # method 2 - pip install xlrd and use engine file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file, engine='xlrd') print(df)</pre><p> 错误: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml' Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml'</p><pre> # method 3 - rename to xlsx and open with openpyxl file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx" df = pd.read_excel(file, engine='openpyxl') print(df)</pre><p> 错误: File is not a zip file (可以选择转换,而不是重命名)。</p><pre> # method 4 - use read_xml file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_xml(file) print(df)</pre><p> 此方法实际上会产生结果,但会产生与工作表没有任何意义的 DataFrame。 大概需要解释 xml (似乎很复杂)?</p><pre> Style Name Table 0 NaN None NaN 1 NaN All funds NaN # method 5 - use read_table file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_table(file) print(df)</pre><p> 此方法将文件读入一列(系列)DataFrame。 那么如何使用这些信息来创建与 xls 文件形状相同的标准 2d DataFrame 呢?</p><pre> 0 &lt;Workbook xmlns="urn:schemas-microsoft-com:off... 1 &lt;Styles&gt; 2 &lt;Style ss:ID="Default"&gt; 3 &lt;Alignment Horizontal="Left"/&gt; 4 &lt;/Style&gt;... ... 226532 &lt;/Cell&gt; 226533 &lt;/Row&gt; 226534 &lt;/Table&gt; 226535 &lt;/Worksheet&gt; 226536 &lt;/Workbook&gt; # method 5 - use read_html file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_html(file) print(df)</pre><p> 这将返回一个空白列表[] ,而人们可能期望至少有一个 DataFrame 列表。</p><p> 所以问题是将这个文件读入 dataframe (或类似的可用格式)的最简单方法是什么?</p></div> - read xls file in pandas / python: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM