Python正則表達式可刪除破折號之間的捕獲電子郵件，或忽略以.jpg等結尾的電子郵件

Question

我試圖弄清楚如何改進正則表達式，使其僅接收不以".jpg"結尾的emails ，並從郵件的左右兩側刪除--如果發現）。 作為source示例參數，它是一個字符串。

<html>
   <body>
   <p>aaa@example.jpg</p>
   <p>--bbb@example.com--</p>
   <p>ccc@example.com--</p>
   <p>--ddd@example.com</p>

</body>
</html>

結果應包含： bbb @ example.com，ccc @ example.com，ddd @ example.com因此，基本上，我想無論如何都希望改進此功能，以便regex可以生成不帶電子郵件的電子郵件-如果可能，可以改善if not email[0].endswith('.png') ，以防萬一我想添加更多，這看起來很緊急。

def extract_emails(source):

    regex = re.compile(r'([\w\-\.]{1,100}@(\w[\w\-]+\.)+[\w\-]+)')
    emails = list(set(regex.findall(source.decode("utf8"))))
    all_emails = []
    for email in emails:
        if not email[0].endswith('.png') and not email[0].endswith('.jpg') \
                and not email[0].endswith('.gif') and not email[0].endswith('.rar')\
                and not email[0].endswith('.zip') and not email[0].endswith('.swf'):
            all_emails.append(email[0].lower())

    return list(set(all_emails))

Answer 1

我認為頂級域名很少，因此您可以使用alternation域名

s="""<html>
   <body>
   <p>aaa@example.jpg</p>
   <p>--bbb@example.com--</p>
   <p>ccc@example.com--</p>
   <p>--ddd@example.com</p>

</body>
</html>"""
print re.findall(r"-*([\w\.]{1,100}@\w[\w\-]+\.+com|biz|us|bd)-*",s)

['bbb@example.com', 'ccc@example.com', 'ddd@example.com']

見演示

或嘗試\\w+@\\w+\\.(?!jpg|png)\\w+\\.*\\w*

s="""<html>
   <body>
   <p>aaa@example.jpg</p>
   <p>--bbb@example.com--</p>
   <p>ccc@example.com--</p>
   <p>--ddd@example.com</p>

</body>
</html>"""
print re.findall(r"\w+@\w+\.(?!jpg|png)\w+\.*\w*",s)

為電子郵件驗證設置常量正則表達式非常困難-有關電子郵件驗證的詳細信息，請參見使用正則表達式來驗證具有69個答案的電子郵件地址。

Answer 2

x="""<html>
   <body>
   <p>aaa@example.jpg</p>
   <p>--bbb@example.com--</p>
   <p>ccc@example.com--</p>
   <p>--ddd@example.com</p>

</body>
</html>"""
print re.findall(r"-*([\w\-\.]{1,100}@(?:\w[\w\-]+\.)+(?!jpg)[\w]+)-*",x)

輸出： ['bbb@example.com', 'ccc@example.com', 'ddd@example.com']

Answer 3

最好的方法是使用HTML解析器，例如BeautifulSoup

In [37]: from bs4 import BeautifulSoup

In [38]: soup = BeautifulSoup('''<html>
   ....:    <body>
   ....:    <p>aaa@example.jpg</p>
   ....:    <p>--bbb@example.com--</p>
   ....:    <p>ccc@example.com--</p>
   ....:    <p>--ddd@example.com</p>
   ....:
   ....: </body>
   ....: </html>''', 'lxml')

In [39]: [email.strip('-') for email in soup.stripped_strings if not email.endswith('.jpg')]
Out[39]: ['bbb@example.com', 'ccc@example.com', 'ddd@example.com']

Python正則表達式可刪除破折號之間的捕獲電子郵件，或忽略以.jpg等結尾的電子郵件

問題描述

3 個解決方案

解決方案1
2 2015-12-03 10:39:30

解決方案2
1 已采納 2015-12-03 10:41:47

解決方案3
0 2015-12-03 10:46:54

Python正則表達式可刪除破折號之間的捕獲電子郵件，或忽略以.jpg等結尾的電子郵件

問題描述

3 個解決方案

解決方案1 2 2015-12-03 10:39:30

解決方案2 1 已采納 2015-12-03 10:41:47

解決方案3 0 2015-12-03 10:46:54

解決方案1
2 2015-12-03 10:39:30

解決方案2
1 已采納 2015-12-03 10:41:47

解決方案3
0 2015-12-03 10:46:54