openpyxl: a better way to read a range of numbers to an array

Question

I am looking for a better (more readable / less hacked together) way of reading a range of cells using openpyxl . What I have at the moment works, but involves composing the excel cell range (eg A1:C3 ) by assembling bits of the string, which feels a bit rough.

At the moment this is how I read nCols columns and nRows rows starting from a particular cell (minimum working example, assumes that worksheet.xlsx is in working directory, and has the cell references written in cells A1 to C3 in Sheet1 :

from openpyxl import load_workbook
import numpy as np

firstCol = "B"
firstRow = 2

nCols = 2
nRows = 2

lastCol = chr(ord(firstCol) + nCols - 1) 

cellRange = firstCol + str(firstRow) + ":" + lastCol + str(firstRow + nRows - 1)

wsName = "Sheet1"
wb = load_workbook(filename="worksheet.xlsx", data_only=True)
data = np.array([[i.value for i in j] for j in wb[wsName][cellRange]])
print(data)

Returns:

[[u'B2' u'C2']
 [u'B3' u'C3']]

As well as being a bit ugly there are functional limitations with this approach. For example in sheets with more than 26 columns it will fail for columns like AA .

Is there a better/correct way to read nRows and nCols from a given top-left corner using openpyxl?

Answer 1

openpyxl provides functions for converting between numerical column indices (1-based index) and Excel's 'AA' style. See the utils module for details.

However, you'll have little need for them in general. You can use the get_squared_range() method of worksheets for programmatic access. And, starting with openpyxl 2.4, you can do the same with the iter_rows() and iter_cols() methods. NB. iter_cols() is not available in read-only mode.

The equivalent MWE using iter_rows() would be:

from openpyxl import load_workbook
import numpy as np
wsName = "Sheet1"
wb = load_workbook(filename="worksheet.xlsx", data_only=True)
ws = wb[wsName]

firstRow = 2
firstCol = 2
nCols = 2
nRows = 2

allCells = np.array([[cell.value for cell in row] for row in ws.iter_rows()])

# allCells is zero-indexed
data = allCells[(firstRow-1):(firstRow-1+nRows),(firstCol-1):(firstCol-1+nCols)]
print(data)

The equivalent MWE using get_squared_range() would be:

from openpyxl import load_workbook
import numpy as np

wsName = "Sheet1"
wb = load_workbook(filename="worksheet.xlsx", data_only=True)

firstCol = 2
firstRow = 2
nCols = 2
nRows = 2

data = np.array([[i.value for i in j] for j in wb[wsName].get_squared_range(
            firstCol, firstRow, firstCol+nCols-1, firstRow+nRows-1)])
print(data)

Both of which return:

[[u'B2' u'C2']
 [u'B3' u'C3']]

See also https://openpyxl.readthedocs.io/en/default/pandas.html for more information on using Pandas and openpyxl together.

Answer 2

For completeness (and so I can find it later) the equivalent code using the pandas function read_excel suggested by @Rob in a comment would be:

import pandas
import numpy as np
wsName = "Sheet1"
df = pandas.read_excel(open("worksheet.xlsx", "rb"), sheetname=wsName, header=None)

firstRow = 2
firstCol = 2
nCols = 2
nRows = 2

# Data-frame is zero-indexed

data = np.array(df.ix[(firstRow-1):(firstRow-2+nRows), (firstRow-1):(firstRow-2+nRows)])
print(data)

Which returns:

[[u'B2' u'C2']
 [u'B3' u'C3']]

openpyxl: a better way to read a range of numbers to an array

Question

2 answers

solution1
1 ACCPTED 2016-10-11 06:57:37

solution2
0 2016-10-11 08:01:31

openpyxl: a better way to read a range of numbers to an array

Question

2 answers

solution1 1 ACCPTED 2016-10-11 06:57:37

solution2 0 2016-10-11 08:01:31

solution1
1 ACCPTED 2016-10-11 06:57:37

solution2
0 2016-10-11 08:01:31