[英]Python (3.5) - Constructing String to Save File - String Contains Escape Characters
我正在使用Python(3.5)遍歷一些.msg文件,從其中提取數據,其中包含下載文件的url和文件應放入的文件夾。 我已經成功地從.msg文件中提取了數據,但是現在當我嘗試拼湊下載文件的絕對文件路徑時,該格式最終會變得很奇怪,並帶有反斜杠和\\ t \\ r。
這是代碼的簡短視圖:
for file in files:
file_abs_path = script_dir + '/' + file
print(file_abs_path)
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
msg = outlook.OpenSharedItem(file_abs_path)
pattern = re.compile(r'(?:^|(?<=\n))[^:<\n]*[:<]\s*([^>\n]*)', flags=re.DOTALL)
results = pattern.findall(msg.Body)
# results[0] -> eventID
regexID = re.compile(r'^[^\/\s]*', flags=re.DOTALL)
filtered = regexID.findall(results[0])
eventID = filtered[0]
# print(eventID)
# results[1] -> title
title = results[1].translate(str.maketrans('','',string.punctuation)).replace(' ', '_') #results[1]
title = unicodedata.normalize('NFKD', title).encode('ascii', 'ignore')
title = title.decode('UTF-8')
#results[1]
print(title)
# results[2] -> account
regexAcc = re.compile(r'^[^\(\s]*', flags=re.DOTALL)
filtered = regexAcc.findall(results[2])
account = filtered[0]
account = unicodedata.normalize('NFKD', account).encode('ascii', 'ignore')
account = account.decode('UTF-8')
# print(account)
# results[3] -> downloadURL
downloadURL = results[3]
# print(downloadURL)
rel_path = account + '/' + eventID + '_' + title + '.mp4'
rel_path = unicodedata.normalize('NFKD', rel_path).encode('ascii', 'ignore')
rel_path = rel_path.decode('UTF-8')
filename_abs_path = os.path.join(script_dir, rel_path)
# Download .mp4 from a url and save it locally under `file_name`:
with urllib.request.urlopen(downloadURL) as response, open(filename_abs_path, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
# print item [ID - Title] when done
print('[Complete] ' + eventID + ' - ' + title)
del outlook, msg
如您所見,我有一些正則表達式可從.msg中提取4條數據。 然后,我必須仔細檢查每一個,並做一些進一步的微調,但要滿足以下條件:
eventID
# 123456
title
# Name_of_item_with_underscord_no_punctuation
account
# nameofaccount
downloadURL
# http://download.com/basicurlandfile.mp4
這就是我得到的數據,我已經將它print()
了,它沒有任何奇怪的字符。 但是,當我嘗試構造.mp4的路徑(文件名和目錄)時:
downloadURL = results[3]
# print(downloadURL)
rel_path = account + '/' + eventID + '_' + title + '.mp4'
rel_path = unicodedata.normalize('NFKD', rel_path).encode('ascii', 'ignore')
rel_path = rel_path.decode('UTF-8')
filename_abs_path = os.path.join(script_dir, rel_path)
# Download .mp4 from a url and save it locally under `file_name`:
with urllib.request.urlopen(downloadURL) as response, open(filename_abs_path, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
完成此操作后,我從運行代碼得到的輸出是:
Traceback (most recent call last): File "sfaScript.py", line 65, in <module> with urllib.request.urlopen(downloadURL) as response, open(filename_abs_path, 'wb') as out_file: OSError: [Errno 22] Invalid argument: 'C:/Users/Kenny/Desktop/sfa_kenny_batch_1\\\\accountnamehere/123456_Name_of_item_with_underscord_no_punctuation\\t\\r.mp4'
TL; DR-問題
因此, filename_abs_path
以某種方式更改為C:/Users/Kenny/Desktop/sfa_kenny_batch_1\\\\accountnamehere/123456_Name_of_item_with_underscord_no_punctuation\\t\\r.mp4
我需要它
C:/Users/Kenny/Desktop/sfa_kenny_batch_1/accountnamehere/123456_Name_of_item_with_underscord_no_punctuation.mp4
感謝您提供的任何幫助!
看起來您的正則表達式在title
捕獲了制表符( \\t
)和換行符( \\r
)
一個快速解決方案是:
title = title.strip()
(在編寫文件名之前)
刪除所有“空白”字符,包括表格和回車符。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.