使用 Python 從 Outlook 電子郵件正文中提取數字

Question

我每小時都會收到電子郵件提醒，告訴我公司在過去一小時內獲得了多少收入。 我想將此信息提取到熊貓數據框中，以便我可以對其進行一些分析。

我的問題是我不知道如何以可用的格式從電子郵件正文中提取數據。 我想我需要使用正則表達式，但我對它們不太熟悉。

這是我到目前為止：

import os
import pandas as pd
import datetime as dt
import win32com.client

outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items

#Empty Lists
email_subject = []
email_date = []
email_content = []

#find emails

for message in messages:
    if message.SenderEmailAddress == 'oracle@xyz.com' and message.Subject.startswith('Demand'):
        email_subject.append(message.Subject)
        email_date.append(message.senton.date()) 
        email_content.append(message.body)

email_content 列表如下所示：

'                                                                                                                   \r\nDemand: $41,225 (-47%)\t                                                                            \r\n                                                                                                                       \r\nOrders: 515 (-53%)\t                                                                                \r\nUnits: 849 (-59%)\t                                                                                 \r\n                                                                                                                       \r\nAOV: $80 (12%)                                                                                                          \r\nAUR: $49 (30%)                                                                                                          \r\n                                                                                                                       \r\nOrders with Promo Code: 3%                                                                                              \r\nAverage Discount: 21%                                                                                             '

誰能告訴我如何將其內容拆分，以便我可以在單獨的列中獲取需求、訂單和單位的 int 值？

謝謝！

Answer 1

您可以使用 string.split() 和 string.strip() 的組合來首先單獨提取每一行。

string = email_content
lines = string.split('\r\n')
lines_stripped = []
for line in lines:
    line = line.strip()
    if line != '':
        lines_stripped.append(line)

這給你一個這樣的數組：

['Demand: $41,225 (-47%)', 'Orders: 515 (-53%)', 'Units: 849 (-59%)', 'AOV: $80 (12%)', 'AUR: $49 (30%)', 'Orders with Promo Code: 3%', 'Average Discount: 21%']

您還可以以更緊湊（pythonic）的方式實現相同的結果：

lines_stripped = [line.strip() for line in string.split('\r\n') if line.strip() != '']

一旦你有了這個數組，你就可以使用正則表達式來提取值。 我推薦https://regexr.com/來試驗你的正則表達式。

經過一些快速實驗， r'([\\S\\s]*):\\s*(\\S*)\\s*\\(?(\\S*)\\)?' 應該管用。

這是從我們上面創建的 lines_stripped 生成字典的代碼：

import re
regex = r'([\S\s]*):\s*(\S*)\s*\(?(\S*)\)?'
matched_dict = {}
for line in lines_stripped:
    match = re.match(regex, line)
    matched_dict[match.groups()[0]] = (match.groups()[1], match.groups()[2])

print(matched_dict)

這會產生以下輸出：

{'AOV': ('$80', '12%)'),
 'AUR': ('$49', '30%)'),
 'Average Discount': ('21%', ''),
 'Demand': ('$41,225', '-47%)'),
 'Orders': ('515', '-53%)'),
 'Orders with Promo Code': ('3%', ''),
 'Units': ('849', '-59%)')}

你要求單位、訂單和需求，所以這里是提取：

# Remove the dollar sign before converting to float
# Replace , with empty string
demand_string = matched_dict['Demand'][0].strip('$').replace(',', '')
print(int(demand_string))
print(int(matched_dict['Orders'][0]))
print(int(matched_dict['Units'][0]))

正如你所看到的，Demand 有點復雜，因為它包含一些額外的字符，python 在轉換為 int 時無法解碼。

這是這 3 次打印的最終輸出：

41225
515
849

希望我回答了你的問題！ 如果您對 regex 有更多疑問，我鼓勵您嘗試使用 regexr，它構建得非常好！

編輯：看起來正則表達式中存在一個小問題，導致最后一個 ')' 包含在最后一組中。 不過這不影響你的問題！

使用 Python 從 Outlook 電子郵件正文中提取數字

問題描述

1 個解決方案

解決方案1
2 已采納 2018-05-31 14:15:54

使用 Python 從 Outlook 電子郵件正文中提取數字

問題描述

1 個解決方案

解決方案1 2 已采納 2018-05-31 14:15:54

解決方案1
2 已采納 2018-05-31 14:15:54