使 Python 程序更快

Question

昨晚我正在開發一個 python 代碼，將 Excel（4 列，30000 行）與另一個 Excel（4 列，30000 行）進行比較，生成另一個發現差異的 Excel。 結構與本文后面的代碼非常相似，首先在另一個excel中找到一個特定的產品，然后比較它們的屬性。 效果很好，問題是：運行這段代碼需要80個小時，而且我需要運行不超過2個。下面這個例子是真實代碼的簡單版本，真實的處理XLSX有3萬行和 90 多列。 我怎樣才能讓它更快？

#imports

from datetime import date

import pandas as pd

import xlsxwriter

#reading XLS's

tabelac = pd.read_excel('TESTE.xlsx', 'Dados')
tabelae = pd.read_excel('TESTE2.xlsx', 'Dados')

#Excluding Nan Values

print(len(tabelac))

for i in range(len(tabelac)):

  produto = tabelac.loc[i,'Produto']
  if pd.isna(produto):
    tabelac.loc[i,'Produto'] = ''

  preco = tabelac.loc[i,'Preço']
  if pd.isna(preco):
    tabelac.loc[i,'Preço'] = ''

  tipo = tabelac.loc[i,'Tipo']
  if pd.isna(tipo):
    tabelac.loc[i,'Tipo'] = ''
    
  Q_vendas = tabelac.loc[i,'Q_vendas']
  if pd.isna(Q_vendas):
    tabelac.loc[i,'Q_vendas'] = ''

for i in range(len(tabelae)):

  produto = tabelae.loc[i,'Produto']
  if pd.isna(produto):
    tabelae.loc[i,'Produto'] = ''

  preco = tabelae.loc[i,'Preço']
  if pd.isna(preco):
    tabelae.loc[i,'Preço'] = ''

  tipo = tabelae.loc[i,'Tipo']
  if pd.isna(tipo):
    tabelae.loc[i,'Tipo'] = ''
    
  Q_vendas = tabelae.loc[i,'Q_vendas']
  if pd.isna(Q_vendas):
    tabelae.loc[i,'Q_vendas'] = ''

#printing XLS's

print(tabelac)
print()
print('--------------------------')
print()
print(tabelae)

#declaring error list

erros = []

#evaluating errors

for i in range(len(tabelac)):
  
  for e in range(len(tabelae)):
    
    if tabelac.loc[i,'Produto'] == tabelae.loc[e,'Produto']:
      
      if tabelac.loc[i,'Preço'] != tabelae.loc[e,'Preço']:
        erros.append(f'Divergência no Preço do produto {tabelac.loc[i,"Produto"]}')
        
      if tabelac.loc[i,'Tipo'] != tabelae.loc[e,'Tipo']:
        erros.append(f'Divergência Tipo do produto {tabelac.loc[i,"Produto"]}')
        
      if tabelac.loc[i,'Q_vendas'] != tabelae.loc[e,'Q_vendas']:
        erros.append(f'Divergência preço do produto {tabelac.loc[i,"Produto"]}')

#evaluating missing products

for i in range(len(tabelac)):
  for e in range(len(tabelae)):
    if tabelac.loc[i,'Produto'] == tabelae.loc[e,'Produto']:
      break
    if tabelac.loc[i,'Produto'] != tabelae.loc[e,'Produto'] and (e+1) == len(tabelae):
      erros.append(f'O produto {tabelac.loc[i,"Produto"]} não foi encontrado em ambas as tabelas')

for i in range(len(tabelae)):
  for e in range(len(tabelac)):
    if tabelae.loc[i,'Produto'] == tabelac.loc[e,'Produto']:
      break
    if tabelae.loc[i,'Produto'] != tabelac.loc[e,'Produto'] and (e+1) == len(tabelac):
      erros.append(f'O produto {tabelae.loc[i,"Produto"]} não foi encontrado em ambas as tabelas')
      
#printing results to compare with XLSX

for i in range(len(erros)):
  print()
  print(erros[i])

#generating the XLS's to exhibit values obtained
 
workbook = xlsxwriter.Workbook('Results.xlsx')
worksheet = workbook.add_worksheet()
row = 3
column = 0

data_atual = date.today()

data_em_texto = '{}/{}/{}'.format(data_atual.day, data_atual.month,data_atual.year)

worksheet.write(0, 0, f'XLS generated in {data_em_texto}')

worksheet.write(1, 0, '')

worksheet.write(2, 0, 'Errors Found')


for item in erros:
 
    # write operation perform
  
    worksheet.write(row, column, item)
 
    # incrementing the value of row by one 
  
    row += 1
     
workbook.close()

如果我應該將第一個 XLSX 與 4 個列表（產品、類型、價格、銷售額）進行比較，是否有更好的方法來執行此任務？

Answer 1

4x30k 行是一個小文件。

正如 StackOverflow 中多次指出的那樣，不要循環遍歷行，pandas 已准備好按矩陣操作。 在您的情況下，您不僅有行迭代，而且嵌套了N² 時間。

例如， .fillna比for + isna快幾個數量級

您有多種方法可以繼續， merge on='Produto'可能是一個不錯的選擇：

how='outer'將創建一個 30x4..8 列。 列將在第一個文件上具有_x ，在第二個文件上具有_y

從那里，您可以查看_x和_y是否有數據並提取如果產品在 A、B 或兩者中

編輯：概念證明示例（未測試）

tabelac = pd.read_excel('TESTE.xlsx', 'Dados')
tabelae = pd.read_excel('TESTE2.xlsx', 'Dados')

merged = tabelac.merge(tabelae, how='outer', left_on='Produto', right_on='Produto')

wrong_price = merged['Preço_x']!=merged['Preço_y']
wrong_type =  merged['Tipo_x']!=merged['Tipo_y']

# (...and so on)

print(f"Products with wrong price: {merged[wrong_price]}")

# To dump the full table at once:
merged[wrong_price].to_excel('items_with_wrong_price.xlsx')

使 Python 程序更快

問題描述

1 個解決方案

解決方案1
0 2022-05-23 15:28:10

使 Python 程序更快

問題描述

1 個解決方案

解決方案1 0 2022-05-23 15:28:10

解決方案1
0 2022-05-23 15:28:10