简体   繁体   中英

How to search and edit excel files fast in Openpyxl

I've 2 worksheets. I need to compare each cell in the sheet 'Data' (350k rows, string) with cells in another sheet, 'Dictionary'. If the string is not in 'Dictionary' or is in the 1st column of 'Dictionary', do nothing. If it is present elsewhere in 'Dictionary', take the value in the corresponding first column. Then go to 'Data' and write it next to where it was initially present in 'Data'.

As mentioned in the title, the problem is speed. This code works for a test file of around 150 rows, but takes 4 minutes to do so. So, it is infeasible to use it for my file. Please tell me how I can speed it up. This is my first python code.

import openpyxl

wb = openpyxl.load_workbook('Test.xlsx')
first_sheet = wb.sheetnames[0]
Data = wb.get_sheet_by_name(first_sheet)
second_sheet = wb.sheetnames[1]
Dictionary = wb.get_sheet_by_name(second_sheet)

for rownum in range(2,Data.max_row+1):
  var1 = Data.cell(row=rownum, column=1).value 
  for rownum1 in range(2,Dictionary.max_row+1):  
    var2 = Dictionary.cell(row=rownum1, column=1).value 
    for colnum2 in range(2,Dictionary.max_column+1):
      var3 = Dictionary.cell(row=rownum1, column=colnum2).value 
      if var1 != var2 and var1 == var3:
       Data.cell(row=rownum, column=4).value = var2
       wb.save('Test.xlsx')
      else:
         None

You can solve your problems by using a hashset, which will let you check for values being present in constant time.

Edit: You wanted a more specific example

Imports and setting up your files:

import openpyxl

wb = openpyxl.load_workbook('Test.xlsx')
first_sheet = wb.sheetnames[0]
Data = wb.get_sheet_by_name(first_sheet)
second_sheet = wb.sheetnames[1]
Dictionary = wb.get_sheet_by_name(second_sheet)

Read each value in Dictionary into memory, creating a dictionary data structure that matches each value in Dictionary that isn't in the first column to the value of the first column in that particular row.

Dict = {}

for row in range(2, Dictionary.max_row + 1):
    for col in range(2, Dictionary.max_column + 1):
        cell_value = Dictionary.cell(row=row, col=col).value
        Dict[cell_value] = Dictionary.cell(row=row, col=1).value

now iterate through Data and perform your operations using Dict:

for row in range(2, Data.max_row+1):
    for col in range(2, Data.max_column + 1):
        cell_value = Data.cell(row=row, col=col).value
        if cell_value in Dict: #if it was elsewhere in Dictionary
            #I'm not sure what you meant by next to so here it just overwrites
            #The value with the corresponding 1st row in Dictionary
            Data.cell(row=row, col=col).value = Dict[cell_value] 

wb.save('Test.xlsx') #save once at the end

maybe it's a bit late but in case anyone is having the same trouble... I had the same issue so i transformed the excelsheet in a numpy 2D array, the search goes waaaaaaaaay faster. This is a modification of my code for the OP's problem:

file = openpyxl.load_workbook(path, data_only=True)
WS_Names= file['Names'] #Worksheet containing the names
NP_Names = np.array(list(WS_Names.values)) #Transformation to numpy 2D Array
WS_Dict = file['Dict'] #Worksheet containing the data
NP_Dict = np.array(list(WS_Dict .values)) #Transformation to numpy 2D Array

names = NP_Names.T[0] #Take the first column containing data

for idx, name in enumerate(names):
    locations = np.column_stack(np.where(name == NP_Dict))
    for row, col in locations:
        if col != 0: # The first column
             WS_Dict.cell(row=idx+1, column=4).value = var2NP_Dict[row,col]    

Hope it helps you :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM