如何改善正則表達式來提取電話號碼？

Question

我正在嘗試使用正則表達式從Web鏈接中提取電話號碼。 我面臨的問題是不需要的id和網頁的其他元素。 如果有人可以提出一些改進，那將非常有幫助。 下面是我在Python中使用的代碼和正則表達式，

from urllib2 import urlopen as uReq
uClient = uReq(url)
page_html = uClient.read()
print re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)

現在，對於大多數網站，腳本獲取一些頁面元素值，有時准確。 請在表達中建議一些修改

re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)

對於不同的url，我的輸出如下所示

http://www.fraitagengineering.com/index.html
['(877) 424-4752']
http://hunterhawk.com/
['1481240672', '1481240643', '1479852632', '1478013441', '1481054486', '1481054560', '1481054598', '1481054588', '1476820246', '1481054521', '1481054540', '1476819829', '1481240830', '1479855986', '1479855990', '1479855994', '1479855895', '1476819760', '1476741750', '1476741750', '1476820517', '1479862863', '1476982247', '1481058326', '1481240672', '1481240830', '1513106590', '1481240643', '1479855986', '1479855990', '1479855994', '1479855895', '1479852632', '1478013441', '1715282331', '1041873852', '1736722557', '1525761106', '1481054486', '1476819760', '1481054560', '1476741750', '1481054598', '1476741750', '1481054588', '1476820246', '1481054521', '1476820517', '1479862863', '1481054540', '1476982247', '1476819829', '1481058326', '(925) 798-4950', '2093796260']
http://www.lbjewelrydesign.com/
['213-629-1823', '213-629-1823']

我想要的電話號碼是(000) 000-0000 (not that I have added space after parenthesis), （000）-000-0000 or 000-000-0000`格式。 任何建議贊賞。 請注意，我已經提到過這個鏈接：在python腳本中查找電話號碼

我需要根據我的特定需求改進正則表達式。

Answer 1

如果只能搜索網頁的純文本，則可以避免在id ，其他屬性或HTML標記內部進行搜索。 您可以通過BeautifulSoup HTML解析器處理網頁內容來實現：

from urllib2 import urlopen as uReq

from bs4 import BeautifulSoup

page_text = BeautifulSoup(uReq(url), "html.parser").get_text()

然后，正如傑克在評論中提到的那樣，你可以使你的正則表達更可靠：

在python腳本中查找電話號碼

Answer 2

以下正則表達式可用於匹配您提供的樣本和其他類似的數字：

(\([0-9]{3}\)[\s-]?|[0-9]{3}-)[0-9]{3}-[0-9]{4}

以下示例腳本可用於測試除正則表達式之外的正面和負面情況：

import re

positiveExamples = [
    '(000) 000-0000',
    '(000)-000-0000',
    '(000)000-0000',
    '000-000-0000'
]
negativeExamples = [
    '000 000-0000',
    '000-000 0000',
    '000 000 0000',
    '000000-0000',
    '000-0000000',
    '0000000000'
]

reObj = re.compile(r"(\([0-9]{3}\)[\s-]?|[0-9]{3}-)[0-9]{3}-[0-9]{4}")

for example in positiveExamples:
    print 'Asserting positive example: %s' % example
    assert reObj.match(example)

for example in negativeExamples:
    print 'Asserting negative example: %s' % example
    assert reObj.match(example) == None

如何改善正則表達式來提取電話號碼？

問題描述

2 個解決方案

解決方案1
1 2017-12-12 21:46:37

解決方案2
1 已采納 2017-12-12 22:10:10

如何改善正則表達式來提取電話號碼？

問題描述

2 個解決方案

解決方案1 1 2017-12-12 21:46:37

解決方案2 1 已采納 2017-12-12 22:10:10

解決方案1
1 2017-12-12 21:46:37

解決方案2
1 已采納 2017-12-12 22:10:10