简体   繁体   English

如何在 Openpyxl 中快速搜索和编辑 Excel 文件

[英]How to search and edit excel files fast in Openpyxl

I've 2 worksheets.我有 2 个工作表。 I need to compare each cell in the sheet 'Data' (350k rows, string) with cells in another sheet, 'Dictionary'.我需要将工作表“数据”(35 万行,字符串)中的每个单元格与另一个工作表“字典”中的单元格进行比较。 If the string is not in 'Dictionary' or is in the 1st column of 'Dictionary', do nothing.如果字符串不在“字典”中或在“字典”的第一列中,则什么都不做。 If it is present elsewhere in 'Dictionary', take the value in the corresponding first column.如果它出现在“字典”的其他地方,则取相应的第一列中的值。 Then go to 'Data' and write it next to where it was initially present in 'Data'.然后转到“数据”并将其写在它最初出现在“数据”中的位置旁边。

As mentioned in the title, the problem is speed.正如标题中提到的,问题是速度。 This code works for a test file of around 150 rows, but takes 4 minutes to do so.此代码适用于大约 150 行的测试文件,但需要 4 分钟。 So, it is infeasible to use it for my file.因此,将它用于我的文件是不可行的。 Please tell me how I can speed it up.请告诉我如何加快速度。 This is my first python code.这是我的第一个python代码。

import openpyxl

wb = openpyxl.load_workbook('Test.xlsx')
first_sheet = wb.sheetnames[0]
Data = wb.get_sheet_by_name(first_sheet)
second_sheet = wb.sheetnames[1]
Dictionary = wb.get_sheet_by_name(second_sheet)

for rownum in range(2,Data.max_row+1):
  var1 = Data.cell(row=rownum, column=1).value 
  for rownum1 in range(2,Dictionary.max_row+1):  
    var2 = Dictionary.cell(row=rownum1, column=1).value 
    for colnum2 in range(2,Dictionary.max_column+1):
      var3 = Dictionary.cell(row=rownum1, column=colnum2).value 
      if var1 != var2 and var1 == var3:
       Data.cell(row=rownum, column=4).value = var2
       wb.save('Test.xlsx')
      else:
         None

You can solve your problems by using a hashset, which will let you check for values being present in constant time.您可以通过使用哈希集来解决您的问题,它可以让您检查恒定时间内存在的值。

Edit: You wanted a more specific example编辑:你想要一个更具体的例子

Imports and setting up your files:导入和设置您的文件:

import openpyxl

wb = openpyxl.load_workbook('Test.xlsx')
first_sheet = wb.sheetnames[0]
Data = wb.get_sheet_by_name(first_sheet)
second_sheet = wb.sheetnames[1]
Dictionary = wb.get_sheet_by_name(second_sheet)

Read each value in Dictionary into memory, creating a dictionary data structure that matches each value in Dictionary that isn't in the first column to the value of the first column in that particular row.将 Dictionary 中的每个值读入内存,创建一个字典数据结构,将 Dictionary 中不在第一列中的每个值与该特定行中第一列的值相匹配。

Dict = {}

for row in range(2, Dictionary.max_row + 1):
    for col in range(2, Dictionary.max_column + 1):
        cell_value = Dictionary.cell(row=row, col=col).value
        Dict[cell_value] = Dictionary.cell(row=row, col=1).value

now iterate through Data and perform your operations using Dict:现在遍历 Data 并使用 Dict 执行您的操作:

for row in range(2, Data.max_row+1):
    for col in range(2, Data.max_column + 1):
        cell_value = Data.cell(row=row, col=col).value
        if cell_value in Dict: #if it was elsewhere in Dictionary
            #I'm not sure what you meant by next to so here it just overwrites
            #The value with the corresponding 1st row in Dictionary
            Data.cell(row=row, col=col).value = Dict[cell_value] 

wb.save('Test.xlsx') #save once at the end

maybe it's a bit late but in case anyone is having the same trouble... I had the same issue so i transformed the excelsheet in a numpy 2D array, the search goes waaaaaaaaay faster.也许有点晚了,但万一有人遇到同样的问题......我遇到了同样的问题,所以我将 excelsheet 转换为一个 numpy 二维数组,搜索速度更快。 This is a modification of my code for the OP's problem:这是我针对 OP 问题的代码的修改:

file = openpyxl.load_workbook(path, data_only=True)
WS_Names= file['Names'] #Worksheet containing the names
NP_Names = np.array(list(WS_Names.values)) #Transformation to numpy 2D Array
WS_Dict = file['Dict'] #Worksheet containing the data
NP_Dict = np.array(list(WS_Dict .values)) #Transformation to numpy 2D Array

names = NP_Names.T[0] #Take the first column containing data

for idx, name in enumerate(names):
    locations = np.column_stack(np.where(name == NP_Dict))
    for row, col in locations:
        if col != 0: # The first column
             WS_Dict.cell(row=idx+1, column=4).value = var2NP_Dict[row,col]    

Hope it helps you :)希望对你有帮助:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM