简体   繁体   中英

How do I read an Excel file into Python using xlrd? Can it read newer Office formats?

My issue is below but would be interested comments from anyone with experience with xlrd.

I just found xlrd and it looks like the perfect solution but I'm having a little problem getting started. I am attempting to extract data programatically from an Excel file I pulled from Dow Jones with current components of the Dow Jones Industrial Average (link: http://www.djindexes.com/mdsidx/?event=showAverages )

When I open the file unmodified I get a nasty BIFF error (binary format not recognized)

However you can see in this screenshot that Excel 2008 for Mac thinks it is in 'Excel 1997-2004' format (screenshot: http://skitch.com/alok/ssa3/componentreport-dji.xls-properties )

If I instead open it in Excel manually and save as 'Excel 1997-2004' format explicitly, then open in python usig xlrd, everything is wonderful. Remember, Office thinks the file is already in 'Excel 1997-2004' format. All files are .xls

Here is a pastebin of an ipython session replicating the issue: http://pastie.textmate.org/private/jbawdtrvlrruh88mzueqdq

Any thoughts on: How to trick xlrd into recognizing the file so I can extract data? How to use python to automate the explicit 'save as' format to one that xlrd will accept? Plan B?

FWIW, I'm the author of xlrd, and the maintainer of xlwt (a fork of pyExcelerator). A few points:

  1. The file ComponentReport-DJI.xls is misnamed; it is not an XLS file, it is a tab-separated-values file. Open it with a text editor (eg Notepad) and you'll see what I mean. You can also look at the not-very-raw raw bytes with Python:

     >>> open('ComponentReport-DJI.xls', 'rb').read(200) 'COMPANY NAME\\tPRIMARY EXCHANGE\\tTICKER\\tSTYLE\\tICB SUBSECTOR\\tMARKET CAP RANGE\\ tWEIGHT PCT\\tUSD CLOSE\\t\\r\\n3M Co.\\tNew York SE\\tMMM\\tN/A\\tDiversified Industria ls\\tBroad\\t5.15676229508\\t50.33\\t\\r\\nAlcoa Inc.\\tNew York SE\\tA' 

    You can read this file using Python's csv module ... just use delimiter="\\t" in your call to csv.reader() .

  2. xlrd can read any file that pyExcelerator can, and read them better—dates don't come out as floats, and the full story on Excel dates is in the xlrd documentation.

  3. pyExcelerator is abandonware—xlrd and xlwt are alive and well. Check out http://groups.google.com/group/python-excel

HTH John

对Office 2007/2008(OpenXML)格式的xlrd支持是alpha测试 - 请参阅python-excel新闻组中的以下帖子: http ://groups.google.com/group/python-excel/msg/0c5f15ad122bf24b?hl = en

More info on pyExcelerator: To read a file, do this:

import pyExcelerator
book = pyExcelerator.parse_xls(filename)

where filename is a string that is the filename to read (not a file-like object). This will give you a data structure representing the workbook: a list of pairs, where the first element of the pair is the worksheet name and the second element is the worksheet data.

The worksheet data is a dictionary, where the keys are (row, col) pairs (starting with 0) and the values are the cell contents -- generally int, float, or string. So, for instance, in the simple case of all the data being on the first worksheet:

data = book[0][1]
print 'Cell A1 of worksheet %s is: %s' % (book[0][0], repr(data[(0, 0)]))

If the cell is empty, you'll get a KeyError. If you're dealing with dates, they may (I forget) come through as integers or floats; if this is the case, you'll need to convert. Basically the rule is: datetime.datetime(1899, 12, 31) + datetime.timedelta(days=n) but that might be off by 1 or 2 (because Excel treats 1900 as a leap-year for compatibility with Lotus, and because I can't remember if 1900-1-1 is 0 or 1), so do some trial-and-error to check. Datetimes are stored as floats, I think (days and fractions of a day).

I think there is partial support for forumulas, but I wouldn't guarantee anything.

Well here is some code that I did: (look down the bottom): here

Not sure about the newer formats - if xlrd can't read it, xlrd needs to have a new version released !

Do you have to use xlrd? I just downloaded 'UPDATED - Dow Jones Industrial Average Movers - 2008' from that website and had no trouble reading it with pyExcelerator .

import pyExcelerator
book = pyExcelerator.parse_xls('DJIAMovers.xls')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM