[英]Is it possible in Python to convert the below strange .XLS file, which is actually in some HTML/XML format to .XLSX?
[英]Force save an xml file to xls format in Python
我在这里有此代码,该代码以Excel 2004 xml格式下载此基金数据:
import urllib2
url = 'https://www.ishares.com/us/258100/fund-download.dl'
s = urllib2.urlopen(url)
contents = s.read()
file = open("export.xml", 'w')
file.write(contents)
file.close()
我的目标是以编程方式将此文件转换为.xls,然后我可以通过该文件将其读取为pandas DataFrame。 我知道我可以使用python的xml库解析此文件,但是,我确实注意到,如果我打开xml文件并使用xls文件扩展名手动保存它,那么pandas可以读取它,并且得到了想要的结果。
我也尝试使用下面的代码重命名文件扩展名,但是该方法不会“强制”保存文件,它仍然是带有xls文件扩展名的基础xml文档。
import os
import sys
folder = '~/models'
for filename in os.listdir(folder):
if filename.startswith('export'):
infilename = filename
newname = infilename.replace('newfile.xls', 'f.xls')
output = os.rename(infilename, newname)
对于Windows版Excel,请考虑使用win32com
模块使用Python来COM连接到Excel对象库。 具体来说,使用Excel的Workbooks.OpenXML和SaveAs方法将下载的xml保存为csv:
import os
import win32com.client as win32
import requests as r
import pandas as pd
cd = os.path.dirname(os.path.abspath(__file__))
url = "http://www.ishares.com/us/258100/fund-download.dl"
xmlfile = os.path.join(cd, 'iSharesDownload.xml')
csvfile = os.path.join(cd, 'iSharesDownload.csv')
# DOWNLOAD FILE
try:
rqpage = r.get(url)
with open(xmlfile, 'wb') as f:
f.write(rqpage.content)
except Exception as e:
print(e)
finally:
rqpage = None
# EXCEL COM TO SAVE EXCEL XML AS CSV
if os.path.exists(csvfile):
os.remove(csvfile)
try:
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.OpenXML(xmlfile)
wb.SaveAs(csvfile, 6)
wb.Close(True)
except Exception as e:
print(e)
finally:
# RELEASES RESOURCES
wb = None
excel = None
# IMPORT CSV INTO PANDAS DATAFRAME
df = pd.read_csv(csvfile, skiprows=8)
print(df.describe())
# Weight (%) Price Coupon (%) YTM (%) Yield to Worst (%) Duration
# count 625.000000 625.000000 625.000000 625.000000 625.000000 625.000000
# mean 0.159888 101.298768 6.500256 5.881168 5.313760 2.128688
# std 0.126833 10.469460 1.932744 4.059226 4.224268 1.283360
# min -0.110000 0.000000 0.000000 0.000000 -8.030000 0.000000
# 25% 0.090000 100.380000 5.130000 3.430000 3.070000 0.970000
# 50% 0.130000 102.940000 6.380000 4.930000 3.910000 2.240000
# 75% 0.190000 105.000000 7.630000 6.820000 6.070000 3.260000
# max 1.750000 128.750000 12.500000 40.900000 40.900000 5.060000
使用Excel for MAC,请考虑使用VBA解决方案,因为VBA是连接到Excel对象库的最常用语言。 下面下载iShares xml,然后将其SaveAs
为csv,以便使用OpenXML
和SaveAs
方法导入熊猫。
注意:这在Mac上未经测试,但希望Microsoft.XMLHTTP对象可用。
VBA (保存在启用宏的工作簿中)
Option Explicit
Sub DownloadXML()
On Error GoTo ErrHandle
Dim wb As Workbook
Dim xmlDoc As Object
Dim xmlfile As String, csvfile As String
xmlfile = ActiveWorkbook.Path & "\file.xml"
csvfile = ActiveWorkbook.Path & "\file.csv"
Call DownloadFile("https://www.ishares.com/us/258100/fund-download.dl", xmlfile)
Set wb = Excel.Workbooks.OpenXML(xmlfile)
wb.SaveAs csvfile, 6
wb.Close True
ExitHandle:
Set wb = Nothing
Set xmlDoc = Nothing
Exit Sub
ErrHandle:
MsgBox Err.Number & " - " & Err.Description, vbCritical
Resume ExitHandle
End Sub
Function DownloadFile(url As String, filePath As String)
Dim WinHttpReq As Object, oStream As Object
Set WinHttpReq = CreateObject("Microsoft.XMLHTTP")
WinHttpReq.Open "GET", url, False
WinHttpReq.send
If WinHttpReq.Status = 200 Then
Set oStream = CreateObject("ADODB.Stream")
oStream.Open
oStream.Type = 1
oStream.Write WinHttpReq.responseBody
oStream.SaveToFile filePath, 2 ' 1 = no overwrite, 2 = overwrite
oStream.Close
End If
Set WinHttpReq = Nothing
Set oStream = Nothing
End Function
蟒蛇
import pandas as pd
csvfile = "/path/to/file.csv"
# IMPORT CSV INTO PANDAS DATAFRAME
df = pd.read_csv(csvfile, skiprows=8)
print(df.describe())
# Weight (%) Price Coupon (%) YTM (%) Yield to Worst (%) Duration
# count 625.000000 625.000000 625.000000 625.000000 625.000000 625.000000
# mean 0.159888 101.298768 6.500256 5.881168 5.313760 2.128688
# std 0.126833 10.469460 1.932744 4.059226 4.224268 1.283360
# min -0.110000 0.000000 0.000000 0.000000 -8.030000 0.000000
# 25% 0.090000 100.380000 5.130000 3.430000 3.070000 0.970000
# 50% 0.130000 102.940000 6.380000 4.930000 3.910000 2.240000
# 75% 0.190000 105.000000 7.630000 6.820000 6.070000 3.260000
# max 1.750000 128.750000 12.500000 40.900000 40.900000 5.060000
我发现与我合作的网站开发了一个api,从而可以规避网络抓取。 然后使用python的requests
模块。
url = "https://www.blackrock.com/tools/hackathon/performance
for ticker in tickers:
params = {'identifiers': ticker ,
'returnsType':'MONTHLY'}
request = requests.get(url, params=params)
json = request.json()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.