简体   繁体   中英

Html parsing using Simple Html dom parser

I am using simple html dom parser to parse some html.

I have an html like this

<span class="UIStory_Message">
    Yeah, elixir of life!<br/>
   <a href="asdfasdf">
      <span>asdfsdfasdfsdf</span>
       <wbr/>
       <span class="word_break"/>
       61193133389&ref=nf
   </a>
</span>

My code is

$storyMessageNodes    = $story->find('span.UIStory_Message');
$storyMessage         = strip_tags($storyMessageNodest->innertext);

I want to get the text right inside the span "UIStory_Message". ie, "Yeah, elixir of life!".

but the above code gives me the whole text which is inside the whole span. ie, "Yeah, elixir of life! asdfsdfasdfsdf 61193133389&ref=nf "

how could i code so that it gives only "Yeah, elixir of life!" ??

I've written a method to get rid of unneeded elements in fetched DOM nodes, I've contacted the author, but simple dom has not been active for two years so I doubt he will include it in the distribution. Here it is:

/**
 * remove specified nodes from selected dom
 *
 * @param string $selector
 * @param int|array (optional) possible values include:
 *   + positive integer - remove first denoted number of elements
 *   + negative integer - remove last denoted number of elements
 *   + array of ones and zeroes - remove the respective matches that equal to one
 *
 * eg.
 *   // will remove first two images found in node
 *   $dom->removeNodes('img',2);
 *
 *   // will remove last two images found in node
 *   $dom->removeNodes('img',-2);
 *
 *   // will remove all but the third images found in node
 *   $dom->removeNodes('img',array(1,1,0,1));
 *
 * [!!!] if there are more matches found than elements in array, the last array member will be used for processing
 *
 * eg.
 *   // will remove second and every following image
 *   $dom->removeNodes('img',array(0,1));
 *
 *   // will remove only the second image
 *   $dom->removeNodes('img',array(0,1,0));
 *
 * @return simple_html_dom_node
 */
public function removeNodes($selector, $limit = NULL)
{
    $elements = $this->find($selector);
    if ( empty($elements) ) return $this;


    if ( isset($limit) && is_int( $limit ) && $limit < 0 ) {
        $limit = abs( $limit );
        $elements = array_reverse( $elements );
    }

    foreach ( $elements as $element ) {

        if ( isset($limit) ) {

            if ( is_array( $limit ) ) {
                $current = current( $limit );
                if ( next( $limit ) === FALSE ) {
                    end( $limit );
                }
                if ( !$current ) {
                    continue;
                }
            } else {
                if ( --$limit === -1 ) {
                    return $this;
                }
            }
        }

        $element->outertext = '';

    }

    return $this;
}

put it in simple_html_dom_node class or one extending it. In the askers case you'd use it like this:

$storyMessageNodes = $story->find('span.UIStory_Message');
$storyMessage = $storyMessageNodes[0]->removeNodes('a')->plaintext

You can do something like this:

$result = $story->find('span.UIStory_Message');

And then substr() on the first < ; one other option is to write a simple regular expression.


I've not tested, this is just a wild guess based on the documentation, try doing:

$story->find('span.UIStory_Message')->plaintext; // same result as strip_tags()?

Or:

$story->find('span.UIStory_Message')->find('text');

If that doesn't work, try playing with these options .

when you only delete the outer text you delete the HTML content itself, but if you perform another find on the same elements it will appear in the result. the reason is that the simple HTML DOM object still has it's internal structure of the element, only without its actual content. what you need to do in order to really delete the element is simply reload the HTML as string to the same variable. this way the object will be recreated without the deleted content, and the simple HTML DOM object will be built without it.

here is an example function:

public function removeNode($selector)
{
    foreach ($html->find($selector) as $node)
    {
        $node->outertext = '';
    }

    $this->load($this->save());        
}

put this function inside the simple_html_dom class and you're good.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM