简体   繁体   中英

XML to XLSX in Python

I've searched high and low for an answer and there doesn't seem to be a definitive solution. Here goes:

from selenium import webdriver

chromedriver_path = ("localchromedrive/chromedriver.exe")
chromeOptions = webdriver.ChromeOptions()
MSCI_dir = ("mylocaldrive")
prefs = {"download.default_directory" : MSCI_dir}
chromeOptions.add_experimental_option("prefs", prefs)
driver = webdriver.Chrome(chromedriver_path,chrome_options=chromeOptions)
url = "https://www.ishares.com/us/239637/fund-download.dl"
driver.get(url)

The file is now downloaded in a local path and saved as the following:

temp_path = "mylocaldrive\iShares-MSCI-Emerging-Markets-ETF_fund.xls"

This file is saved as an ".xls" file type but it is clearly an XML file. See below for the file opened up in NotePad. 在此处输入图片说明

I've tried xlrd:

import xlrd
book = xlrd.open_workbook(temp_path)
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'

I've tried xml.etree:

import xml.etree.ElementTree as ET
tree = ET.parse(temp_path)
File "<string>", line unknown
ParseError: mismatched tag: line 16, column 2`

I've tried xlwings:

wb = xw.Book(temp_path)
wb.save(xlsx_path)
wb.close()`

which looks like it works, but when I try and use pandas I get this:

pd.read_excel(xlsx_path)
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbf<?xml'`

I've tried BeautifulSoup

from bs4 import BeautifulSoup`
soup = BeautifulSoup(open(temp_path), "xml")`

In [1]: soup
Out[1]: <?xml version="1.0" encoding="utf-8"?>`

In [2]: soup.contents
Out[2]: []`

In [3]: soup.get_text()
Out[3]: ''`

I'm looking for the definitive way to access this file with pandas. Let me know what info you need from me that I'm missing.

I think that your problem is that the file is not a XLS but a XLSX file that is a special XML file made by Microsoft to reduce the DOC and XLS files size.

Look: https://en.wikipedia.org/wiki/Microsoft_Office_XML_formats

https://msdn.microsoft.com/en-us/library/dd922181(v=office.12).aspx

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM