Making a Python Program faster

Question

Last night i was developing a python code that compares a Excel (4 columns, 30 thousand lines) to another Excel (4 columns, 30 thousand lines), generating another Excel with the differences found. The structure is very similar to the code following this text, firstly, it finds an specific product in the other excel, and them compares their attributes. It works really well, the problem is: it takes 80 hours to run this code, and i need to run in no longer than 2. This example below is a simple version of the real code, the real one deals with XLSX with 30 thousand rows and more than 90 columns. How can i make it quicker?

#imports

from datetime import date

import pandas as pd

import xlsxwriter

#reading XLS's

tabelac = pd.read_excel('TESTE.xlsx', 'Dados')
tabelae = pd.read_excel('TESTE2.xlsx', 'Dados')

#Excluding Nan Values

print(len(tabelac))

for i in range(len(tabelac)):

  produto = tabelac.loc[i,'Produto']
  if pd.isna(produto):
    tabelac.loc[i,'Produto'] = ''

  preco = tabelac.loc[i,'Preço']
  if pd.isna(preco):
    tabelac.loc[i,'Preço'] = ''

  tipo = tabelac.loc[i,'Tipo']
  if pd.isna(tipo):
    tabelac.loc[i,'Tipo'] = ''
    
  Q_vendas = tabelac.loc[i,'Q_vendas']
  if pd.isna(Q_vendas):
    tabelac.loc[i,'Q_vendas'] = ''

for i in range(len(tabelae)):

  produto = tabelae.loc[i,'Produto']
  if pd.isna(produto):
    tabelae.loc[i,'Produto'] = ''

  preco = tabelae.loc[i,'Preço']
  if pd.isna(preco):
    tabelae.loc[i,'Preço'] = ''

  tipo = tabelae.loc[i,'Tipo']
  if pd.isna(tipo):
    tabelae.loc[i,'Tipo'] = ''
    
  Q_vendas = tabelae.loc[i,'Q_vendas']
  if pd.isna(Q_vendas):
    tabelae.loc[i,'Q_vendas'] = ''

#printing XLS's

print(tabelac)
print()
print('--------------------------')
print()
print(tabelae)

#declaring error list

erros = []

#evaluating errors

for i in range(len(tabelac)):
  
  for e in range(len(tabelae)):
    
    if tabelac.loc[i,'Produto'] == tabelae.loc[e,'Produto']:
      
      if tabelac.loc[i,'Preço'] != tabelae.loc[e,'Preço']:
        erros.append(f'Divergência no Preço do produto {tabelac.loc[i,"Produto"]}')
        
      if tabelac.loc[i,'Tipo'] != tabelae.loc[e,'Tipo']:
        erros.append(f'Divergência Tipo do produto {tabelac.loc[i,"Produto"]}')
        
      if tabelac.loc[i,'Q_vendas'] != tabelae.loc[e,'Q_vendas']:
        erros.append(f'Divergência preço do produto {tabelac.loc[i,"Produto"]}')

#evaluating missing products

for i in range(len(tabelac)):
  for e in range(len(tabelae)):
    if tabelac.loc[i,'Produto'] == tabelae.loc[e,'Produto']:
      break
    if tabelac.loc[i,'Produto'] != tabelae.loc[e,'Produto'] and (e+1) == len(tabelae):
      erros.append(f'O produto {tabelac.loc[i,"Produto"]} não foi encontrado em ambas as tabelas')

for i in range(len(tabelae)):
  for e in range(len(tabelac)):
    if tabelae.loc[i,'Produto'] == tabelac.loc[e,'Produto']:
      break
    if tabelae.loc[i,'Produto'] != tabelac.loc[e,'Produto'] and (e+1) == len(tabelac):
      erros.append(f'O produto {tabelae.loc[i,"Produto"]} não foi encontrado em ambas as tabelas')
      
#printing results to compare with XLSX

for i in range(len(erros)):
  print()
  print(erros[i])

#generating the XLS's to exhibit values obtained
 
workbook = xlsxwriter.Workbook('Results.xlsx')
worksheet = workbook.add_worksheet()
row = 3
column = 0

data_atual = date.today()

data_em_texto = '{}/{}/{}'.format(data_atual.day, data_atual.month,data_atual.year)

worksheet.write(0, 0, f'XLS generated in {data_em_texto}')

worksheet.write(1, 0, '')

worksheet.write(2, 0, 'Errors Found')


for item in erros:
 
    # write operation perform
  
    worksheet.write(row, column, item)
 
    # incrementing the value of row by one 
  
    row += 1
     
workbook.close()

If I were supposed to compare the first XLSX with 4 lists(Products, Type, Price, Sales Amount), is there a better method to perform this task?

Answer 1

4x30k rows is a small file.

As pointed multiple times in StackOverflow, do not loop over rows, pandas is prepared to operate matrix-wise . In your case you not only you have row iteration but nested for which builds to N² time.

For instance, .fillna is orders of magnitude faster than for + isna

You have multiple ways to proceed, merge on='Produto' is likely a good candidate:

how='outer' Will create a 30x4..8 columns. Columns will have _x on the first file, _y on the second one

From there, you can see if _x and _y have data and extract the if product is in A, in B, or in both

Edit: proof of concept example (not tested)

tabelac = pd.read_excel('TESTE.xlsx', 'Dados')
tabelae = pd.read_excel('TESTE2.xlsx', 'Dados')

merged = tabelac.merge(tabelae, how='outer', left_on='Produto', right_on='Produto')

wrong_price = merged['Preço_x']!=merged['Preço_y']
wrong_type =  merged['Tipo_x']!=merged['Tipo_y']

# (...and so on)

print(f"Products with wrong price: {merged[wrong_price]}")

# To dump the full table at once:
merged[wrong_price].to_excel('items_with_wrong_price.xlsx')

Making a Python Program faster

Question

1 answers

solution1
0 2022-05-23 15:28:10

Making a Python Program faster

Question

1 answers

solution1 0 2022-05-23 15:28:10

solution1
0 2022-05-23 15:28:10