简体   繁体   中英

Extracting data from Excel Pivot Table Spreadsheet in Linux

I have an excel spreadsheet based on a pivot table that is periodically updated (monthly) and uploaded to my server (generated by a group that is very hesitant to changing anything in the output). I would like to be able to write a script that I could run via cron job to process and load the raw data from the pivot table into my database.

However, I can't figure out how to get at the underlying data without manually going into windows, opening the file in excel, double-clicking the totals cell, getting a new sheet with all the raw data that went into populating that cell, and saving that sheet as a csv, that I can then load into my database via some language (in my case python). It seems like their should be some scriptable way to extract the underlying data.

I only have linux machines (running windows/office in a VM; but I'd prefer a solution that doesn't involve windows). I am familiar with tools like xls2csv (which doesn't access the raw data) and using tools like python-unoconv to edit openoffice documents from python. However, even manually using openoffice I don't see a way to get at the underlying data.

EDIT: After spending a good few hours not making any progress (prior to posting this), I'm not starting to make some by converting it to ODS via unoconv and likely will be able to use something with python-odf to extract the last sheet (Called 'DPCache').

So now the problem is to get a sheet from an ODS converted into a CSV; which shouldn't be too hard for me to figure out (though help is greatly appreciated).

Have you tried xlrd ? See also the tutorial available from the python-excel website .

It's this simple:

>>> import xlrd
>>> book = xlrd.open_workbook('pivot_table_demo.xls')
>>> sheet = book.sheet_by_name('Summary')
>>> for row_index in xrange(sheet.nrows):
...     print sheet.row_values(row_index)
...
[u'Sum of sales', u'qtr', '', '', '', '']
[u'person', 1.0, 2.0, 3.0, 4.0, u'Grand Total']
[u'dick', 100.0, 99.0, 95.0, 90.0, 384.0]
[u'harriet', 100.0, 110.0, 121.0, 133.1, 464.1]
[u'tom', 100.0, 101.0, 102.0, 103.0, 406.0]
[u'Grand Total', 300.0, 310.0, 318.0, 326.1, 1254.1]
>>>

i used to have the same issue. You can resolved by unzip the xlsx and reading/interpret the xml files. The two files that are more important are these.

  • xl/pivotCache/pivotCacheDefinition1.xml
  • xl/pivotCache/pivotCacheRecords1.xml

The first one, have the relationshit of the raw data in pivotCacheRecords1.xml, that you need to access by index number, what i mean by this, is that by every column in pivotCacheRecords1.xml that have the tag <x> you need to obtain the data in pivotCacheDefinition1.xml by the index number of the tag <x> , for better understanding, you need to see the xml files.

pivotCacheDefinition1.xml

    <?xml version="1.0" encoding="UTF-8"?>
<pivotCacheDefinition xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" r:id="rId1" refreshedBy="ADNLatam" refreshedDate="42972.64919178241" createdVersion="5" refreshedVersion="6" recordCount="1923161">
   <cacheSource type="external" connectionId="1" />
   <cacheFields count="26">
      <cacheField name="C - Cadenas" numFmtId="0" sqlType="-9">
         <sharedItems count="3">
            <s v="superA" />
            <s v="superB" />
            <s v="superC" u="1" />
         </sharedItems>
      </cacheField>
      <cacheField name="C - Locales" numFmtId="0" sqlType="-9"><span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>
         <sharedItems count="80">
            <s v="Itaugua" />
            <s v="Denis Roa" />
            <s v="Total" />
            <s v="Los Laureles" />
            <s v="CDE" />
            <s v="S6 Fdo." />
            <s v="Central" u="1" />
            <s v="Unicompra" u="1" />
            <s v="San Lorenzo Centro" u="1" />
         </sharedItems>
      </cacheField>
   </cacheFields>
</pivotCacheDefinition>
</xml>

pivotCacheRecords1.xml

<pivotCacheRecords
xmlns="http://schemas.openxmlformats.org/spreadsheetml/2006/main"
xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" count="246209">
<r>
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <s v="PAÐAL &quot;PAMPERS&quot; BABYSAN REGULAR GDE 9UN"/> #Z - Sku / Descripcion
    <s v="07501006720341"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <n v="1"/> 
    <n v="11990"/> 
    <n v="2.3199999999999998"/> 
    <n v="10900"/> 
    <n v="11990"/> 
    <n v="1"/> 
    <d v="2012-02-03T00:00:00"/> 
    <x v="0"/> 
    <x v="0"/> 
    <n v="3"/> 
    <n v="6"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
    <x v="0"/> 
</r>

See that the <x> in CacheRecords1 tag is a relation for the <s> tag in CacheDefinition1, now if you understand this, is not so dificult to make a dict to use it in the iterations of the records.

      definitions = '/tmp/scantrack_tmp/xl/pivotCache/pivotCacheDefinition1.xml'
      defdict = {}
      columnas = []
      e = xml.etree.ElementTree.parse(definitions).getroot()
      for fields in e.findall('{http://schemas.openxmlformats.org/spreadsheetml/2006/main}cacheFields'):
          for cidx, field in enumerate(fields.getchildren()):
              columna = field.attrib.get('name')
              defdict[cidx] = []
              columnas.append(columna)
              for value in field.getchildren()[0].getchildren():
                  tagname = value.tag
                  defdict[cidx].append(value.attrib.get('v', 0))

We endup whith this dict.

{
  0: ['supera', 'superb', u'superc'],
  1: ['Terminal',
     'CDE',
     'Brasilia',
     ]
  3: ['PANTENE', 'DOVE']
  ...
}

Then all you have todo is iterate over CacheRecords1 and match the index of the column with the key in defdict when the tag is <x>

  dfdata = []


  bdata = '/tmp/scantrack_tmp/xl/pivotCache/pivotCacheRecords1.xml'

  for event, elem in xml.etree.ElementTree.iterparse(bdata, events=('start', 'end')):
    if elem.tag == '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}r' and event == 'start':
       tmpdata = []
       for cidx, valueobj in enumerate(elem.getchildren()):
           tagname = valueobj.tag
           vattrib = valueobj.attrib.get('v')
           rdata = vattrib
           if tagname == '{http://schemas.openxmlformats.org/spreadsheetml/2006/main}x':
                try:
                  rdata = defdict[cidx][int(vattrib)]
                except:
                  logging.error('this it not should happen index cidx = {} vattrib = {} defaultidcts = {} tmpdata for the time = {} xml raw {}'.format(
                                                                                                                                                cidx, vattrib, defdict, tmpdata,
                                                                                                                                                xml.etree.ElementTree.tostring(elem, encoding='utf8', method='xml')
                                                                                                                                                ))
           tmpdata.append(rdata)
       if tmpdata:
           dfdata.append(tmpdata)
       elem.clear()

Then you can put dfdata in a dataframe

df = pd.DataFrame(dfdata).

The rest is history, i wish this would help you.

Happy Coding!!!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM