简体   繁体   English

从Python中的html中提取电话号码

[英]Extract phone number from html in Python

The phone number is hidden (555 143Â ....) until user click it '555 1437662', but is in the onclick parameter... what options can I use to get the phone number from the HTML below...? 电话号码被隐藏(555 143 ....),直到用户点击它'555 1437662',但是在onclick参数...我可以使用哪些选项从下面的HTML中获取电话号码??

<html>
    <body>
        <h3 id="resultTelBar">
            <span onclick="showFullNumber(this, '555 1437662');
                dcsMultiTrack('DCSext._mainreq','','DCSext.linktype',
                'telephone show','DCSext.linkplace','','DCSext.linkvalue','555 1437662',
                'DCSext.show_listingId','SA_6597739_4638_003722_8396251_IYMX',
                DCSext.show_zoningUsed','0','DCSext.show_resultNumber','1')"
                >086 143 ....</span>
        </h3>
    </body>
</html>

I noticed beautyfulsoup tag but suggest you my variant with lxml . 我注意到beautyfulsoup标签,但建议你使用lxml我的变种。 You can use it if you like. 如果你愿意,你可以使用它。 I don't care much about regular expression, you can improve it if it doesn't work in some cases. 我不太关心正则表达式,如果在某些情况下不起作用,你可以改进它。

>>> import re
>>> from lxml import etree
>>> etree.fromstring(u'''YOUR HTML''')
>>> onclick = html.xpath('//h3[@id="resultTelBar"]/span/@onclick')[0]
>>> print re.search("showFullNumber\(this,\s*'([\d ]+)'", onclick).group(1)
555 1437662

The information is embedded in a script that's included as a string in a tag attribute? 信息嵌入在标记属性中作为字符串包含的脚本中? That's... very unfortunate. 那......非常不幸。

(Edit: To clarify, I'm assuming the question here is "given this unfortunate html/javascript as input, how can I parse out the phone number with BeautifulSoup". Please advise if this is incorrect.) (编辑:为了澄清,我假设这里的问题是“给出这个不幸的html / javascript作为输入,我如何用BeautifulSoup解析出电话号码。”请告知这是不正确的。)

I suppose the easiest thing is to isolate that javascript string and then use a regex to extract the number. 我想最简单的方法是隔离那个javascript字符串,然后使用正则表达式来提取数字。 However, the regex part will be a PITA and fairly fragile. 但是,正则表达式部分将是PITA并且相当脆弱。

soup.find('h3', id='resultTelBar').span['onclick'] will get you the string, assuming soup is the BeautifulSoup object. soup.find('h3', id='resultTelBar').span['onclick']会得到字符串,假设汤是BeautifulSoup对象。 Then use re.search to parse the numbers out of the first line. 然后使用re.search来解析第一行中的数字。 What exact regex you use depends on how regular the results are (is every javascript string formatted in that way, including line breaks? etc.) and how robust you need it to be for for instance foreign telephone numbers, or in case the javascript in future versions of this data is tweaked slightly. 你使用什么样的正则表达式取决于结果的规律程度(每个javascript字符串是以这种方式格式化的,包括换行符等等)以及你需要它多么强大,例如外国电话号码,或者javascript in此数据的未来版本略有调整。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM