openpyxl Python Iterating Through Large Data List

Question

I have a large excel workbook with 1 sheet with roughly 45,000 rows and 45 columns. I want to iterate through the columns looking for duplicates and unique items and its taking a very long time to go through individual columns. Is there anyway to optimize my code or make this go faster? I either want to print the information or save to txt file. I'm on windows 10 and python 2.7 using openpyxl module:

    from openpyxl import load_workbook, worksheet, Workbook
    import os

    #read work book to get data
    wb = load_workbook(filename = 'file.xlsx', use_iterators = True)
    ws = wb.get_sheet_by_name(name = 'file') 
    wb = load_workbook(filename='file.xlsx', read_only=True)

    count = 0
    seen = set()
    uniq = []

    for cell in ws.columns[0]:
       if cell not in seen:
         uniq.append(cell)
         seen.add(cell)

    print("Unique: "+uniq)
    print("Doubles: "+seen)

EDIT: Lets say I have 5 columns A,B,C,D,E and 10 entries, so 10 rows, 5x10. In column AI want to extract all the duplicates and separate them from the unique values.

Answer 1

As VedangMehta mentioned, Pandas will do it very quickly for you.

Run this code:

import pandas as pd
#read in the dataset:
df = pd.read_excel('file.xlsx', sheetname = 'file')

df_dup = df.groupby(axis=1, level=0).apply(lambda x: x.duplicated())

#save duplicated values from first column
df[df_dup].iloc[:,0].to_csv("file_duplicates_col1.csv")

#save unique values from first column
df[~df_dup].iloc[:,0].to_csv("file_unique_col1.csv")

#save duplicated values from all columns:
df[df_dup].to_csv("file_duplicates.csv")

#save unique values from all columns:
df[df_dup].to_csv("file_unique.csv")

For details, see below:

Suppose your dataset looks as follows:

df = pd.DataFrame({'a':[1,3,1,13], 'b':[13,3,5,3]})
df.head()
Out[24]:
    a   b
0   1  13
1   3   3
2   1   5
3  13   3

You can find which values are duplicated in each column:

df_dup = df.groupby(axis=1, level=0).apply(lambda x: x.duplicated())

the result:

df_dup

Out[26]:
       a      b
0  False  False
1  False  False
2   True  False
3  False   True

you can find the duplicated values by subsetting the df using the boolean dataframe df_dup

df[df_dup]
Out[27]:
     a    b
0  NaN  NaN
1  NaN  NaN
2  1.0  NaN
3  NaN  3.0

Again, you can save that using:

 #save the above using:
 df[df_dup].to_csv("duplicated_values.csv")

to see the duplicated values in the first column, use:

df[df_dup].iloc[:,0]

to get

Out[11]:
0    NaN
1    NaN
2    1.0
3    NaN
Name: a, dtype: float64

For unique calues, use ~ which is Python's not sign. So you're essentially subsetting df by values that are Not duplicates

df[~df_dup]

Out[29]:
      a     b
0   1.0  13.0
1   3.0   3.0
2   NaN   5.0
3  13.0   NaN

Answer 2

When working with read-only mode don't use the columns property to read a worksheet. This is because data is stored in rows so columns require the parser to continually re-read the file.

This is an example of using openpyxl to convert worksheets into Pandas dataframes. It requires openpyxl 2.4 or higher, which at the time of writing can must be checked out.

openpyxl Python Iterating Through Large Data List

Question

2 answers

solution1
0 ACCPTED 2016-06-02 15:06:45

Run this code:

For details, see below:

solution2
0 2016-06-02 16:20:29

openpyxl Python Iterating Through Large Data List

Question

2 answers

solution1 0 ACCPTED 2016-06-02 15:06:45

Run this code:

For details, see below:

solution2 0 2016-06-02 16:20:29

solution1
0 ACCPTED 2016-06-02 15:06:45

solution2
0 2016-06-02 16:20:29