简体   繁体   中英

How do I use the PHP Simple HTML DOM Parser to parse this?

Here is an example of the HTML I need to parse into a PHP program:

                    <div id="dump-list">    
<div class="dump-row"> 
 <div class="dump-location odd" data-jmapping="{id: 35, point: {lng: -73.00898601, lat: 41.71727402}, category: 'office'}">

    <div class="SingleLinkNoTx">
    <a href="#10" class="loc-link">Acme Software</a><br/><strong>John Doe, MBA</strong><br/>123 Main St.<br />New York, NY 10036<br /><strong class="telephone">(212) 555-1234</strong><br/>
    </div><!-- END.SingleLinkNoTx -->

    <a href="http://www.example.com" target="_blank" class="web_link">Visit Website</a><span><br />(0.3 miles)</span>   
    <div class="loc-info">
            <div class="loc-info-text ">
        John Doe, MBA<br /><a href="http://maps.google.com/?daddr=41.71727402,-73.00898601" target="_blank">Get Directions &raquo;</a>    
        </div>

    </div>

</div>

This is the information I want to extract from the above HTML example into PHP:

lng: -73.00898601, lat: 41.71727402
category: 'office'
Acme Software
John Doe, MBA
123 Main St.
New York, NY 10036
(212) 555-1234
http://www.example.com

I have tried using PHP Simple HTML DOM Parser, but I'm new to it and can't find a working PHP example that pertains to what I need to do. I tried some PHP code like this to understand how this works, but the var_dump($e) produces huge amounts of output and has messages in the var_dump about recursion. So I'm lost how to really use this. Greatly appreciate some kind help!

$e = $html->find('.dump-location', 0)->find('.SingleLinkNoTx', 0);
echo $e;
var_dump($e);

Use XPath to find and extract elements in an HTML/XML document - specifically the SimpleXMLElement::xpath method.

The following example will find the telephone number for a location:

$doc = new DOMDocument();
$doc->loadHTML('your html snippet goes here - or use loadHTMLFile()');
$xml = simplexml_import_dom($doc);
$elements = $xml->xpath('//*[contains(@class, "dump-location")]/div[@class="SingleLinkNoTx"]/strong[@class="telephone"]');
print_r($elements);

The most complex part is the XPath expression. A quick breakdown:

  1. //
    • This rule tells the parser to recursively apply rules to all elements in the document.
  2. *[contains(@class, "dump-location")]
    • Matches any element that has the dump-location class
  3. /
    • Tells the parser to apply the next rule only to elements that have a dump-location parent.
  4. div[@class="SingleLinkNoTx"]
    • Matches any DIV element that has a SingleLinkNoTx class (and no other class name).
  5. strong
    • Rule that matches all the STRONG tags with a telephone class.

Using this XPath expression on the HTML snippet provided in the question will result in output like the following. Which is fairly easy to iterate and extract information from:

Array
(
    [0] => SimpleXMLElement Object
        (
            [@attributes] => Array
                (
                    [class] => telephone
                )

            [0] => (212) 555-1234
        )

)

If you know the document structure it's possible to construct an XPath expression for each piece of information you want to extract. Or, it might be simpler to use a more general XPath expression (say, an expression that retrieves all dump-location elements) and manually iterate though the elements.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM