Python Script takes very long time to run

Question

I've managed to write a piece of code (composed by multiple sources along the web, and adapted to my needs) which should do the following:

Reads an excel file
From column A to search the value of each cell within the subject of mails from a specific folder
If matches (cell value equal to first 9 characters of the subject), save the attachment (each mail has only one attachment, no more, no less) with the value of cell in an "output" folder.
If doesn't match, go to the next mail, respectively next cell value.
In the end, display the run time (not very important, only for my knowledge)

The code actually works (tested with an email folder with only 9 emails). My problem is the run time.

The actual scope of the script is to look for 2539 values in a folder with 32700 emails and save the attachments.

I've done 2 runs as follow:

2539 values in 32700 emails (stopped after ~1 hour)
10 values in 32700 emails (stopped after ~40 minutes; in this time the script processed 4 values)

I would like to know / learn, if there a way to make the script faster, or if it's slow because it's bad written etc.

Below is my code:

from pathlib import Path
import win32com.client
import os
from datetime import datetime
import time
import openpyxl

#name of the folder created for output
output_dir = Path.cwd() / "Orders"

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")

folder = outlook.Folders.Item("Shared Mailbox Name")
inbox = folder.Folders.Item("Inbox")

messages = inbox.Items

wb = openpyxl.load_workbook(r"C:\Users\TEST\Path-to-excel\FolderName\ExcelName.xlsx")
sheet = wb['Sheet1']
names=sheet['A'] 

for cellObj in names:
    ordno = str(cellObj.value)
    print(ordno)

    for message in messages:
        subject = message.Subject
        body = message.body
        attachments = message.Attachments

        if str(subject)[:9] == ordno:
            output_dir.mkdir(parents=True, exist_ok=True)
            for attachment in attachments:
                attachment.SaveAsFile(output_dir / str(attachment))
        else:
            pass

start = time()
print(f'Time taken to run: {time() - start} seconds')

I need to mention that I am a complete rookie in Python thus any help from the community is welcomed, especially next to some clarifications of what I did wrong and why.

I've also read some similar questions but nothing helps, or at least I don't know how to adopt the methods.

Thank you!

Answer 1

Your use case is too specific for anyone to be able to recreate, and hints about performance only generic but your main problem is a combination of "O x N" and synchronous processing: currently you are processing one value, one message at a time, which includes disk IO to get the e-mail.

You can certainly improve things by creating a single list of values from the workbook. You can then use this list with a processing pool (see the Python documentation) to read multiple e-mails at once.

But things might be even better if you can use the subject to query the mail server.

If you have follow-up questions, please break them down to specific parts of the task.

Answer 2

First of all, instead of iterating over all items in the folder:

 for message in messages:
        subject = message.Subject

And then checking whether a subject starts from the specified string or includes such string:

if str(subject)[:9] == ordno:

Instead, you need to use the Find / FindNext or Restrict methods of the Items` class where you could get collection of items that correspond to your search criteria. Read more about these methods in the following articles:

How To: Use Find and FindNext methods to retrieve Outlook mail items from a folder (C#, VB.NET)
How To: Use Restrict method to retrieve Outlook mail items from a folder

For example, you could use the following restriction on the collection (taken form the VBA sample):

criteria = "@SQL=" & Chr(34) & "urn:schemas:httpmail:subject" & Chr(34) & " ci_phrasematch 'question'"

See Filtering Items Using a String Comparison for more information.

Also you may find the AdvancedSearch method of the Application class helpful. The key benefits of using the AdvancedSearch method in Outlook are:

The search is performed in another thread. You don't need to run another thread manually since the AdvancedSearch method runs it automatically in the background.
Possibility to search for any item types: mail, appointment, calendar, notes etc. in any location, ie beyond the scope of a certain folder. The Restrict and Find / FindNext methods can be applied to a particular Items collection (see the Items property of the Folder class in Outlook).
Full support for DASL queries (custom properties can be used for searching too). To improve the search performance, Instant Search keywords can be used if Instant Search is enabled for the store (see the IsInstantSearchEnabled property of the Store class).
You can stop the search process at any moment using the Stop method of the Search class.

See Advanced search in Outlook programmatically: C#, VB.NET for more information on that.

Answer 3

Seems to me the main problem with your program is that you have two nested loop (one over the values & one over the mails) when you only need to loop over the mails and check if their subject is in the list of values.

First you need to construct your list of value with something like :

ordno_values = [str(cellObj.value) for cellObj in names]

then, in your loop over mails, you just need to adapt the condition to :

if str(subject)[:9] in ordno_values:

Python Script takes very long time to run

Question

3 answers

solution1
0 2022-06-14 15:15:13

solution2
0 2022-06-14 15:18:06

solution3
0 ACCPTED 2022-06-15 06:42:47

Python Script takes very long time to run

Question

3 answers

solution1 0 2022-06-14 15:15:13

solution2 0 2022-06-14 15:18:06

solution3 0 ACCPTED 2022-06-15 06:42:47

solution1
0 2022-06-14 15:15:13

solution2
0 2022-06-14 15:18:06

solution3
0 ACCPTED 2022-06-15 06:42:47