简体   繁体   中英

Python: Simulating CSV.DictReader with OpenPyXL

I have an Excel (.xlsx) file that I'm trying to parse, row by row. I have a header (first row) that has a bunch of column titles like School, First Name, Last Name, Email, etc.

When I loop through each row, I want to be able to say something like:

row['School']

and get back the value of the cell in the current row and the column with 'School' as its title.

I've looked through the OpenPyXL docs but can't seem to find anything terribly helpful.

Any suggestions?

I'm not incredibly familiar with OpenPyXL, but as far as I can tell it doesn't have any kind of dict reader/iterator helper. However, it's fairly easy to iterate over the worksheet rows, as well as to create a dict from two lists of values.

def iter_worksheet(worksheet):
    # It's necessary to get a reference to the generator, as 
    # `worksheet.rows` returns a new iterator on each access.
    rows = worksheet.rows

    # Get the header values as keys and move the iterator to the next item
    keys = [c.value for c in next(rows)]
    for row in rows:
        values = [c.value for c in row]
        yield dict(zip(keys, values))

Excel sheets are far more flexible than CSV files so it makes little sense to have something like DictReader.

Just create an auxiliary dictionary from the relevant column titles.

If you have columns like "School", "First Name", "Last Name", "EMail" you can create the dictionary like this.

keys = dict((value, idx) for (idx, value) in enumerate(values))
for row in ws.rows[1:]:
    school = row[keys['School'].value

I wrote DictReader based on openpyxl. Save the second listing to file 'excel.py' and use it as csv.DictReader. See usage example in the first listing.

with open('example01.xlsx', 'rb') as source_data:
    from excel import DictReader

    for row in DictReader(source_data, sheet_index=0):
        print(row)

excel.py:

__all__ = ['DictReader']

from openpyxl import load_workbook
from openpyxl.cell import Cell

Cell.__init__.__defaults__ = (None, None, '', None)   # Change the default value for the Cell from None to `` the same way as in csv.DictReader


class DictReader(object):
    def __init__(self, f, sheet_index,
                 fieldnames=None, restkey=None, restval=None):
        self._fieldnames = fieldnames   # list of keys for the dict
        self.restkey  = restkey         # key to catch long rows
        self.restval  = restval         # default value for short rows
        self.reader   = load_workbook(f, data_only=True).worksheets[sheet_index].iter_rows(values_only=True)
        self.line_num = 0

    def __iter__(self):
        return self

    @property
    def fieldnames(self):
        if self._fieldnames is None:
            try:
                self._fieldnames = next(self.reader)
                self.line_num += 1
            except StopIteration:
                pass

        return self._fieldnames

    @fieldnames.setter
    def fieldnames(self, value):
        self._fieldnames = value

    def __next__(self):
        if self.line_num == 0:
            # Used only for its side effect.
            self.fieldnames

        row = next(self.reader)
        self.line_num += 1

        # unlike the basic reader, we prefer not to return blanks,
        # because we will typically wind up with a dict full of None
        # values
        while row == ():
            row = next(self.reader)

        d = dict(zip(self.fieldnames, row))
        lf = len(self.fieldnames)
        lr = len(row)

        if lf < lr:
            d[self.restkey] = row[lf:]
        elif lf > lr:
            for key in self.fieldnames[lr:]:
                d[key] = self.restval

        return d

The following seems to work for me.

    header = True
    headings = []
    for row in ws.rows:
        if header:
            for cell in row:
                headings.append(cell.value)
            header = False
            continue
        rowData = dict(zip(headings, row))
        wantedValue = rowData['myHeading'].value

I was running into the same issue as described above. Therefore I created a simple extension called openpyxl-dictreader<\/a> that can be installed through pip. It is very similar to the suggestion made by @viktor earlier in this thread.

It allows you to select items based on column names using openpyxl. For example:

import openpyxl_dictreader

reader = openpyxl_dictreader.DictReader("names.xlsx", "Sheet1")
for row in reader:
    print(row["First Name"], row["Last Name"])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM