简体   繁体   中英

find class name of html source using php

I am new to PHP. I want to write code to find the id specified in the html code below, which is 1123 . Can any one give me some idea?

<span class="miniprofile-container /companies/1123?miniprofile="
      data-tracking="NUS_CMPY_FOL-nhre"
      data-li-getjs="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=dyt8o4nwtaujeutlgncuqe0dn&amp;fc=2">
    <strong>
        <a href="http://www.linkedin.com/nus-trk?trkact=viewCompanyProfile&pk=biz-overview-public&pp=1&poster=&uid=5674666402166894592&ut=NUS_UNIU_FOLLOW_CMPY&r=&f=0&url=http%3A%2F%2Fwww%2Elinkedin%2Ecom%2Fcompany%2F1123%3Ftrk%3DNUS_CMPY_FOL-nhre&urlhash=7qbc">
        Bank of America
        </a>
    </strong>
</span> has a new Project Manager

Note: I don't need the content in the span class. I need the id in the span class name.

I tried the following:

$dom = new DOMDocument('1.0', 'UTF-8');
@$dom->loadHTML($html);
$xmlElements = simplexml_import_dom($dom);
$id = $xmlElements->xpath("//span [@class='miniprofile-container /companies/$data_id?miniprofile=']");

... but I don't know how to proceed further.

dependent of your need, you could do

$matches = array();
preg_match('|<span class="miniprofile-container /companies/(\d+)\?miniprofile|', $html, $matches);
print_r($matches);

this is a very trivial regex, but could serve as a first suggestion. If you want to go via DomDocument or simplexml, you mustn't mix both like you did in your example. What is your preferred way, we can narrow this down then.

//edit: pretty much what @fireeyedboy said, but this is what I just fiddled together:

<?php
$html = <<<EOD
<html><head></head>
<body>
<span class="miniprofile-container /companies/1123?miniprofile="
      data-tracking="NUS_CMPY_FOL-nhre"
      data-li-getjs="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=dyt8o4nwtaujeutlgncuqe0dn&amp;fc=2">
    <strong>
        <a href="#">
        Bank of America
        </a>
    </strong>
</span> has a new Project Manager

</body>
</html>
EOD;

$domDocument = new DOMDocument('1.0', 'UTF-8');
$domDocument->recover = TRUE;
$domDocument->loadHTML($html);

$xPath = new DOMXPath($domDocument);
$relevantElements = $xPath->query('//span[contains(@class, "miniprofile-container")]');
$foundId = NULL;
foreach($relevantElements as $match) {
    $pregMatches = array();
    if (preg_match('|/companies/(\d+)\?miniprofile|', $match->getAttribute('class'), $pregMatches)) {
        if (isset($pregMatches[1])) {
            $foundId = $pregMatches[1];
            break;
        }
    };
}

echo $foundId;

?>

This should do what you are after:

$dom = new DOMDocument('1.0', 'UTF-8');
@$dom->loadHTML( $html );
$xpath = new DOMXPath( $dom );

/*
 * the following xpath query will find all class attributes of span elements
 * whose class attribute contain the strings " miniprofile-container " and " /companies/"
 */
$nodes = $xpath->query( "//span[contains(concat(' ', @class, ' '), ' miniprofile-container ') and contains(concat(' ', @class, ' '), ' /companies/')]/@class" );
foreach( $nodes as $node )
{
    // extract the number found between "/companies/" and "?miniprofile" in the node's nodeValue
    preg_match( '#/companies/(\d+)\?miniprofile#', $node->nodeValue, $matches );
    var_dump( $matches[ 1 ] );
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM