[英]How to remove a substring that starts and ends with certain characters in Python
[英]Email extraction starts and ends with unwanted characters (python)
所以我有一個程序可以提取電子郵件和電話號碼。 我跑了,電話號碼很好。 但是,這些電子郵件將繼續導致:例如:3465Usjohnson@astate.eduUProvost而不是sjohnson@astate.edu正在從中提取的環繞文本:870-972-3465Usjohnson@astate.eduUProvost和副總理博士。 Lynita Cooksey870-972-2 030 870-972-2036Ulcooksey@astate.edu
在實際的PDF中,有白色和空白,但是在復制和粘貼時,它們之間沒有空格,因此也沒有我收到的電子郵件。(看起來像是: 在此處輸入圖片說明
#! python 3
import re, pyperclip
# Regex for phone numbers
phoneRegex = re.compile(r'''
# 860-555-3951, 555-3951, (860) 555-3951, 555-3951 ext 12345, ext. 12345, x12345
(
((\d\d\d)|(\(\d\d\d\)))? #area code (optional)
(\s|-) #first seperator
\d\d\d #first 3 digits
- #second seperator
\d\d\d\d #last 4 digits
(((ext(\.)?\s)|x) #Extension-words (optional)
(\d{2,5}))? #Extension - numbers (optional)
)
''', re.VERBOSE)
#Regex for Emails
emailRegex = re.compile(r'''
#some._+thing@(/d{2,5}))?.com
[a-zA-Z0-9_.+]+ #Name part
@ #@ symbol
[a-zA-Z0-9_.+]+ #domain
''', re.VERBOSE)
#pyperclip get text off
text = pyperclip.paste()
#extract
extractedPhone = phoneRegex.findall(text)
extractedEmail = emailRegex.findall(text)
allPhoneNumbers = []
for phoneNumber in extractedPhone:
allPhoneNumbers.append(phoneNumber[0])
#copy to clipboard
results = '\n'.join(allPhoneNumbers) + '\n'.join(extractedEmail)
pyperclip.copy(results)
因為我沒有原始文本,所以我將使用您示例中的字符串。
看看以下兩個正則表達式是否適合您。 我還包括一個更精確的三分之一。
'(?<=\\dU)[\\w]+@[\\w\\.]+?(?=U|\\s|$)'
。
'(?<=\\dU)[\\w]+@[\\w]+\\.[\\w]+?(?=U|\\s|$)'
。
示例測試
>>> import re
>>> string = '''3465Usjohnson@astate.eduUProvost instead of sjohnson@astate.edu The surround text that it is being extracted from: 870-972-3465Usjohnson@astate.eduUProvost and Vice ChancellorDr. Lynita Cooksey870-972-2 030 870-972-2036Ulcooksey@astate.edu'''
>>> re.findall('(?<=\dU)[\w]+@[\w\.]+?(?=U|\s|$)', string)
#Output
['sjohnson@astate.edu', 'sjohnson@astate.edu', 'lcooksey@astate.edu']
>>> re.findall('(?<=\dU)[\w]+@[\w]+\.[\w]+?(?=U|\s|$)', string)
#Output
['sjohnson@astate.edu', 'sjohnson@astate.edu', 'lcooksey@astate.edu']
。
更准確一點,因為電子郵件都以.edu
'(?<=\\dU)[\\w]+@[\\w]*\\.edu(?=U|\\s|$)'
。
示例測試
>>> string = '''3465Usjohnson@astate.eduUProvost instead of sjohnson@astate.edu The surround text that it is being extracted from: 870-972-3465Usjohnson@astate.eduUProvost and Vice ChancellorDr. Lynita Cooksey870-972-2 030 870-972-2036Ulcooksey@astate.edu'''
>>> re.findall('(?<=\dU)[\w]+@[\w]*\.edu(?=U|\s|$)', string)
#Output
['sjohnson@astate.edu', 'sjohnson@astate.edu', 'lcooksey@astate.edu']
我自己是Python新手。 如果文本是從' astate.edu '網站專門提取的,我想你可以使用這個正則表達式:
text='70-972-3465Usjohnson@astate.eduUProvost and Vice ChancellorDr. Lynita Cooksey870-972-2 030 870-972-2036Ulcooksey@astate.edu'
import re
email= re.findall('[a-z]+\@\w+\.edu',text)
#output
['sjohnson@astate.edu', 'lcooksey@astate.edu']
祝好運!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.