Is there a faster way on reading, writing and saving excel files?

Question

I am new to Python. Currently I need to count the number of duplicates, delete the duplicate ones and update the duplicates occasions into a new column. Below is my code:

import pandas as pd
from openpyxl import load_workbook


filepath = '/Users/jordanliu/Desktop/test/testA.xlsx'
data = load_workbook(filepath)


sku = data.active

duplicate_column = []
for x in range(sku.max_row):

    duplicate_count = 0

    for i in range(x):
        if sku.cell(row =i + 2, column = 1).value == sku.cell(row = x + 2, column = 1).value:
            duplicate_count = duplicate_column[i] + 1
            sku.cell(row =i+2, column = 1).value = 0

    duplicate_column.append(duplicate_count)


for x in range(len(duplicate_column)):
    sku.cell(row=x + 2, column=3).value = duplicate_column[x]

for y in range(sku.max_row):
    y = y + 1
    if sku.cell(row = y, column = 1).value == 0:
        sku.delete_rows(y,1)


data.save(filepath)

I've tried using both pandas but because the execution time takes extraordinary long, I've decided to change to openpyxl but it doesn't seem to help much. Many from other post have suggest to use CSV but because its the writing process that takes the majority of the time I thought that it wouldn't help much. Can someone please provide me some help here?

Answer 1

for x in range(sku.max_row):

    duplicate_count = 0

    for i in range(x):
        if sku.cell(row =i + 2, column = 1).value == sku.cell(row = x + 2, column = 1).value:
            duplicate_count = duplicate_column[i] + 1
            sku.cell(row =i+2, column = 1).value = 0

For this portion, you are rechecking the same values over and over. Assuming these should be unique totally, which is how I think your code is written, then you should instead implement a cache of a hashed type (dict or set) to do these subsequent lookups instead of doing the lookup via sku.cell every time.

So it would be something like:

xl_cache = {}
duplicate_count = {}
delete_set = set()
for x in range(sku.max_row):
    x_val = sku.cell(row = x, column = 1).value
    if x_val in xl_cache:   # then this is not first time
        xl_cache[x_val][1] += 1   # increase duplicate count
        delete_set.add(x)
    else:
        xl_cache[x_val] = x   # key is value for duplicate cache, value is row number
        duplicate_count[x] = 0   # key is row number, value is duplicate count

So now you have a dictionary of originals with duplicate counts, you need to go back and delete your rows that you don't want plus change the duplicate counts in the sheet. So go backwards through the range and delete the row or update the duplicate count. You can do this by going to your max first and reducing by 1, check for delete first, otherwise change the duplicate.

y = sku.max_row
for i in range(y, 0, -1):
    if i in delete_set:
        sku.delete_rows(i,1)
    else:
        sku.cell(row=i, column=3) = duplicate_count[i]

In theory, this would only traverse your range twice in total, and lookups from the cache would be O(1) on average. You need to traverse this in reverse to maintain row order as you delete rows.

Since I don't actually have your sample data I can't test this code completely so there could be minor issues but I tried to use the structures that you have in your code to make it easily usable for you.

Is there a faster way on reading, writing and saving excel files?

Question

1 answers

solution1
0 2019-09-16 20:52:09

Is there a faster way on reading, writing and saving excel files?

Question

1 answers

solution1 0 2019-09-16 20:52:09

solution1
0 2019-09-16 20:52:09