简体   繁体   中英

Read manually inserted text from an Excel spreadsheet

I have a .xlsx file containing my university's timetable. I'm working on an application that makes use of the timetable. But I don't want to "copy" the timetable contents from this Excel spreadsheet into a more "programmer-friendly" format, instead, I'd like to write a program/script that would parse this .xlsx table and automatically convert it in the format I need (eg in some objects in code).

There's no trouble for me in reading "normal" cells of the spreadsheet. However, instead of simply putting 1 text entry in each cell, the person who created this timetable file manually "divided" some cells into "subcells" and manually inserted some text in each of them. This looks like: 图。1

How should this be interpreted: students are divided into 4 groups. At 15.20-16.50 only groups number 1 and 2 will have a specific class. At 17.00-18.30 only groups 1, 3, and 4 will have that class.

As one can see, these "cells" are not real cells — they seem to have been created ("divided") manually, just like the text that is selected in the picture.

The question is: how do I find and read such "cells" (manually inserted text components) like in the picture (preferably also knowing their position so that I can not only read what classes exist, but also when they start (time is stated in the very left of the spreadsheet)) ?

I tried using Python 's xlrd module but haven't been able to achieve what I need. Neither have I had any success with Java 's Apache POI — I just can't find how to read such text entries. Solutions on both languages, no matter what libraries and approaches are used, will be fine for me.

Both xls and xslx are proprietary formats. Microsoft went out of their way to explain in court that xslx is open, but unfortunately not one of the judges involved knew anything significant about computer science and the lawyers knew it, so don't get distracted by their misleading case. XSLX has the option for the 'vendor' to add a block of 'custom binary blobs' and the vast majority of the excel features that aren't the most common, lowest level stuff imaginable are in these binary blobs. No doubt this 'stick a text table object into a single cell' thing that's going on here is exactly like that.

Microsoft has never released any documentation on these binary blobs, nor any library that can parse them.

Therefore, Apache POI, xlrd, and all other libraries to read XLS files that do not explicitly require Excel to be installed and running on the computer that's running the 'library' (kind of a tricky thing to pull if you have eg a linux-based server!) are based on reverse engineering it, and it's a horrible format. Literally - look up what Apache POI's 'HSSF' stands for. Officially nothing, but etymologically, that H is for Horrible. (Horrible Spread Sheet Format - HSSF).

That's the long way around of saying: Sorry - you probably can't. And it's not the fault of POI or xlrd, it's on microsoft. It is not appropriate to use such a closed, proprietary and undocumented format to transfer anything meaningful. The error lies in whatever process led to the situation that you're now stuck trying to write software to parse a weird excel file.

If you must, most likely a script running within excel can untangle this mess and write out a csv file or json or something in a documented format. Alternatively, you can write something in C#, but it would just be farming out the work to excel, so, you still would not be able to port this code to other platforms.

Apache POI does give you the option of a more low-level approach where you can read the binary blobs. You can attempt to reverse engineer whatever's going on in that 'cell-with-a-table-in-it' yourself, but as neither the xlrd team nor the Apache POI team has bothered, and at least the POI team is on record as saying the format seems to be designed to be obfuscated - that sounds like a job that will take you many, many weeks.

That gets me back to the solution I advised earlier: Unless spending many weeks building an incredibly fragile stack that requires a full blown windows and an excel license is the lesser evil compared to a simple change in human behaviour (unlikely), the fix lies in addressing the process (as in, address that excel is used to transfer this info, or at least make the excel sheet muuuch simpler than this thing), and not by finding out how to read this mess in java or python.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM