繁体   English   中英

在Python中强制将xml文件保存为xls格式

[英]Force save an xml file to xls format in Python

我在这里有此代码,该代码以Excel 2004 xml格式下载此基金数据:

import urllib2
url = 'https://www.ishares.com/us/258100/fund-download.dl'
s = urllib2.urlopen(url)
contents = s.read()
file = open("export.xml", 'w')
file.write(contents)
file.close()

我的目标是以编程方式将此文件转换为.xls,然后我可以通过该文件将其读取为pandas DataFrame。 我知道我可以使用python的xml库解析此文件,但是,我确实注意到,如果我打开xml文件并使用xls文件扩展名手动保存它,那么pandas可以读取它,并且得到了想要的结果。

我也尝试使用下面的代码重命名文件扩展名,但是该方法不会“强制”保存文件,它仍然是带有xls文件扩展名的基础xml文档。

import os
import sys
folder = '~/models'
for filename in os.listdir(folder):
    if filename.startswith('export'):
        infilename = filename
        newname = infilename.replace('newfile.xls', 'f.xls')
        output = os.rename(infilename, newname)

https://www.ishares.com/us/258100/fund-download.dl

对于Windows版Excel,请考虑使用win32com模块使用Python来COM连接到Excel对象库。 具体来说,使用Excel的Workbooks.OpenXMLSaveAs方法将下载的xml保存为csv:

import os
import win32com.client as win32    
import requests as r
import pandas as pd

cd = os.path.dirname(os.path.abspath(__file__))

url = "http://www.ishares.com/us/258100/fund-download.dl"
xmlfile = os.path.join(cd, 'iSharesDownload.xml')
csvfile = os.path.join(cd, 'iSharesDownload.csv')

# DOWNLOAD FILE
try:
    rqpage = r.get(url)
    with open(xmlfile, 'wb') as f:
        f.write(rqpage.content)    
except Exception as e:
    print(e)    
finally:
    rqpage = None

# EXCEL COM TO SAVE EXCEL XML AS CSV
if os.path.exists(csvfile):
    os.remove(csvfile)
try:
    excel = win32.gencache.EnsureDispatch('Excel.Application')
    wb = excel.Workbooks.OpenXML(xmlfile)
    wb.SaveAs(csvfile, 6)
    wb.Close(True)    
except Exception as e:
    print(e)    
finally:
    # RELEASES RESOURCES
    wb = None
    excel = None

# IMPORT CSV INTO PANDAS DATAFRAME
df = pd.read_csv(csvfile, skiprows=8)
print(df.describe())

#        Weight (%)       Price  Coupon (%)     YTM (%)  Yield to Worst (%)    Duration
# count  625.000000  625.000000  625.000000  625.000000          625.000000  625.000000
# mean     0.159888  101.298768    6.500256    5.881168            5.313760    2.128688
# std      0.126833   10.469460    1.932744    4.059226            4.224268    1.283360
# min     -0.110000    0.000000    0.000000    0.000000           -8.030000    0.000000
# 25%      0.090000  100.380000    5.130000    3.430000            3.070000    0.970000
# 50%      0.130000  102.940000    6.380000    4.930000            3.910000    2.240000
# 75%      0.190000  105.000000    7.630000    6.820000            6.070000    3.260000
# max      1.750000  128.750000   12.500000   40.900000           40.900000    5.060000

使用Excel for MAC,请考虑使用VBA解决方案,因为VBA是连接到Excel对象库的最常用语言。 下面下载iShares xml,然后将其SaveAs为csv,以便使用OpenXMLSaveAs方法导入熊猫。

注意:这在Mac上未经测试,但希望Microsoft.XMLHTTP对象可用。

VBA (保存在启用宏的工作簿中)

Option Explicit

Sub DownloadXML()
On Error GoTo ErrHandle
    Dim wb As Workbook
    Dim xmlDoc As Object
    Dim xmlfile As String, csvfile As String

    xmlfile = ActiveWorkbook.Path & "\file.xml"
    csvfile = ActiveWorkbook.Path & "\file.csv"

    Call DownloadFile("https://www.ishares.com/us/258100/fund-download.dl", xmlfile)

    Set wb = Excel.Workbooks.OpenXML(xmlfile)

    wb.SaveAs csvfile, 6
    wb.Close True

ExitHandle:
    Set wb = Nothing
    Set xmlDoc = Nothing
    Exit Sub

ErrHandle:
    MsgBox Err.Number & " - " & Err.Description, vbCritical
    Resume ExitHandle
End Sub

Function DownloadFile(url As String, filePath As String)
    Dim WinHttpReq As Object, oStream As Object

    Set WinHttpReq = CreateObject("Microsoft.XMLHTTP")
    WinHttpReq.Open "GET", url, False
    WinHttpReq.send

    If WinHttpReq.Status = 200 Then
        Set oStream = CreateObject("ADODB.Stream")
        oStream.Open
        oStream.Type = 1
        oStream.Write WinHttpReq.responseBody
        oStream.SaveToFile filePath, 2 ' 1 = no overwrite, 2 = overwrite
        oStream.Close
    End If

    Set WinHttpReq = Nothing
    Set oStream = Nothing
End Function

蟒蛇

import pandas as pd

csvfile = "/path/to/file.csv"

# IMPORT CSV INTO PANDAS DATAFRAME
df = pd.read_csv(csvfile, skiprows=8)
print(df.describe())

#        Weight (%)       Price  Coupon (%)     YTM (%)  Yield to Worst (%)    Duration
# count  625.000000  625.000000  625.000000  625.000000          625.000000  625.000000
# mean     0.159888  101.298768    6.500256    5.881168            5.313760    2.128688
# std      0.126833   10.469460    1.932744    4.059226            4.224268    1.283360
# min     -0.110000    0.000000    0.000000    0.000000           -8.030000    0.000000
# 25%      0.090000  100.380000    5.130000    3.430000            3.070000    0.970000
# 50%      0.130000  102.940000    6.380000    4.930000            3.910000    2.240000
# 75%      0.190000  105.000000    7.630000    6.820000            6.070000    3.260000
# max      1.750000  128.750000   12.500000   40.900000           40.900000    5.060000

我发现与我合作的网站开发了一个api,从而可以规避网络抓取。 然后使用python的requests模块。

url = "https://www.blackrock.com/tools/hackathon/performance
for ticker in tickers:
    params = {'identifiers': ticker ,
              'returnsType':'MONTHLY'}
    request = requests.get(url, params=params)
    json = request.json()

读取 pandas / python 中的 xls 文件:不支持的格式,或损坏的文件:预期的 BOF 记录; 找到 b'\xef\xbb\xbf <!--?xml'</div--><div id="text_translate"><p> 我正在尝试将xls文件(只有一个选项卡)打开到 pandas dataframe 中。</p><p> It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: <a href="https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu" rel="nofollow noreferrer">https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl= 0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu</a> 。</p><p> 我注意到前两行合并了单元格,一些列也是如此。</p><p> 我尝试了几种方法(来自堆栈),但都失败了。</p><pre> # method 1 - read excel file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file) print(df)</pre><p> 错误: Excel file format cannot be determined, you must specify an engine manually.</p><pre> # method 2 - pip install xlrd and use engine file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file, engine='xlrd') print(df)</pre><p> 错误: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml' Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml'</p><pre> # method 3 - rename to xlsx and open with openpyxl file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx" df = pd.read_excel(file, engine='openpyxl') print(df)</pre><p> 错误: File is not a zip file (可以选择转换,而不是重命名)。</p><pre> # method 4 - use read_xml file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_xml(file) print(df)</pre><p> 此方法实际上会产生结果,但会产生与工作表没有任何意义的 DataFrame。 大概需要解释 xml (似乎很复杂)?</p><pre> Style Name Table 0 NaN None NaN 1 NaN All funds NaN # method 5 - use read_table file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_table(file) print(df)</pre><p> 此方法将文件读入一列(系列)DataFrame。 那么如何使用这些信息来创建与 xls 文件形状相同的标准 2d DataFrame 呢?</p><pre> 0 &lt;Workbook xmlns="urn:schemas-microsoft-com:off... 1 &lt;Styles&gt; 2 &lt;Style ss:ID="Default"&gt; 3 &lt;Alignment Horizontal="Left"/&gt; 4 &lt;/Style&gt;... ... 226532 &lt;/Cell&gt; 226533 &lt;/Row&gt; 226534 &lt;/Table&gt; 226535 &lt;/Worksheet&gt; 226536 &lt;/Workbook&gt; # method 5 - use read_html file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_html(file) print(df)</pre><p> 这将返回一个空白列表[] ,而人们可能期望至少有一个 DataFrame 列表。</p><p> 所以问题是将这个文件读入 dataframe (或类似的可用格式)的最简单方法是什么?</p></div>

[英]read xls file in pandas / python: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是否可以在 Python 中将以下奇怪的 .XLS 文件(实际上是某种 HTML/XML 格式)转换为 .XLSX? 在python中读取包含xml数据的xls文件 尝试使用 Python 解析 XLS (XML) 文件 我可以使用 python 打开 XML 文件,格式化,然后再次保存吗? 无法在Python,xlrd.biffh.XLRD中打开.xls文件。错误:格式不受支持,或者文件损坏:预期的BOF记录; 找到了 读取 pandas / python 中的 xls 文件:不支持的格式,或损坏的文件:预期的 BOF 记录; 找到 b'\xef\xbb\xbf <!--?xml'</div--><div id="text_translate"><p> 我正在尝试将xls文件(只有一个选项卡)打开到 pandas dataframe 中。</p><p> It is a file that i can normally read in excel or excel for the web, in fact here is the raw file itself: <a href="https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl=0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu" rel="nofollow noreferrer">https://www.dropbox.com/scl/fi/zbxg8ymjp8zxo6k4an4dj/product-screener.xls?dl= 0&amp;rlkey=3aw7whab78jeexbdkthkjzkmu</a> 。</p><p> 我注意到前两行合并了单元格,一些列也是如此。</p><p> 我尝试了几种方法(来自堆栈),但都失败了。</p><pre> # method 1 - read excel file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file) print(df)</pre><p> 错误: Excel file format cannot be determined, you must specify an engine manually.</p><pre> # method 2 - pip install xlrd and use engine file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_excel(file, engine='xlrd') print(df)</pre><p> 错误: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml' Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf&lt;?xml'</p><pre> # method 3 - rename to xlsx and open with openpyxl file = "C:\\Users\\admin\\Downloads\\product-screener.xlsx" df = pd.read_excel(file, engine='openpyxl') print(df)</pre><p> 错误: File is not a zip file (可以选择转换,而不是重命名)。</p><pre> # method 4 - use read_xml file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_xml(file) print(df)</pre><p> 此方法实际上会产生结果,但会产生与工作表没有任何意义的 DataFrame。 大概需要解释 xml (似乎很复杂)?</p><pre> Style Name Table 0 NaN None NaN 1 NaN All funds NaN # method 5 - use read_table file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_table(file) print(df)</pre><p> 此方法将文件读入一列(系列)DataFrame。 那么如何使用这些信息来创建与 xls 文件形状相同的标准 2d DataFrame 呢?</p><pre> 0 &lt;Workbook xmlns="urn:schemas-microsoft-com:off... 1 &lt;Styles&gt; 2 &lt;Style ss:ID="Default"&gt; 3 &lt;Alignment Horizontal="Left"/&gt; 4 &lt;/Style&gt;... ... 226532 &lt;/Cell&gt; 226533 &lt;/Row&gt; 226534 &lt;/Table&gt; 226535 &lt;/Worksheet&gt; 226536 &lt;/Workbook&gt; # method 5 - use read_html file = "C:\\Users\\admin\\Downloads\\product-screener.xls" df = pd.read_html(file) print(df)</pre><p> 这将返回一个空白列表[] ,而人们可能期望至少有一个 DataFrame 列表。</p><p> 所以问题是将这个文件读入 dataframe (或类似的可用格式)的最简单方法是什么?</p></div> 如何使用Python将XML文件导入Excel XLS文件模板? 如何在python中打开、删除列和保存xls文件 Python读取SAS生成的XML类型.xls文件 使用python从XML .xls文件中提取特定的单元格数据
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM