简体   繁体   中英

Extract phone number from html in Python

The phone number is hidden (555 143Â ....) until user click it '555 1437662', but is in the onclick parameter... what options can I use to get the phone number from the HTML below...?

<html>
    <body>
        <h3 id="resultTelBar">
            <span onclick="showFullNumber(this, '555 1437662');
                dcsMultiTrack('DCSext._mainreq','','DCSext.linktype',
                'telephone show','DCSext.linkplace','','DCSext.linkvalue','555 1437662',
                'DCSext.show_listingId','SA_6597739_4638_003722_8396251_IYMX',
                DCSext.show_zoningUsed','0','DCSext.show_resultNumber','1')"
                >086 143 ....</span>
        </h3>
    </body>
</html>

I noticed beautyfulsoup tag but suggest you my variant with lxml . You can use it if you like. I don't care much about regular expression, you can improve it if it doesn't work in some cases.

>>> import re
>>> from lxml import etree
>>> etree.fromstring(u'''YOUR HTML''')
>>> onclick = html.xpath('//h3[@id="resultTelBar"]/span/@onclick')[0]
>>> print re.search("showFullNumber\(this,\s*'([\d ]+)'", onclick).group(1)
555 1437662

The information is embedded in a script that's included as a string in a tag attribute? That's... very unfortunate.

(Edit: To clarify, I'm assuming the question here is "given this unfortunate html/javascript as input, how can I parse out the phone number with BeautifulSoup". Please advise if this is incorrect.)

I suppose the easiest thing is to isolate that javascript string and then use a regex to extract the number. However, the regex part will be a PITA and fairly fragile.

soup.find('h3', id='resultTelBar').span['onclick'] will get you the string, assuming soup is the BeautifulSoup object. Then use re.search to parse the numbers out of the first line. What exact regex you use depends on how regular the results are (is every javascript string formatted in that way, including line breaks? etc.) and how robust you need it to be for for instance foreign telephone numbers, or in case the javascript in future versions of this data is tweaked slightly.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM