简体   繁体   中英

How can i scrape invalid html using php simple dom?

I'm trying to scrape a webpage using phpsimpledom.

$html = '<div class="namepageheader"> 
            <div class="u">Name: <a href="someurl">Noor Shaad</a>
            <div class="u">Age: </div>
        </div> ' 
$name=$html->find('div[class="u"]', 0)->innertext;
$age=$html->find('div[class="u"]', 1)->innertext;

I tried my best to get text from each class="u" but it didn't work because there is missing closing tag </div> on first tag <div class="u"> . Can anyone help me out with that....

You can find an element close to where the tag should have been closed and then standardize the html by replacing it. For example, you can replace the </a> tag by </a></div> .

str_replace('</a>','</a></div>',$html)

or if there are too many closed </a> tags, replace </a><div class="u"> with </a></div><div class="u">

str_replace('</a><div class="u">','</a></div><div class="u">',$html)

There may be another problem. There is a gap between the tags and the replacement does not work properly. To solve this problem, you can first delete the spaces between the tags and then replace them.

$html = '<div class="namepageheader"> 
            <div class="u">Name: <a href="someurl">Noor Shaad</a>
            <div class="u">Age: </div>
        </div> ' ;
$html = preg_replace('~>\\s+<~m', '><', $html);
str_replace('</a><div class="u">','</a></div><div class="u">',$html);
$name=$html->find('div[class="u"]', 0)->innertext;
$age=$html->find('div[class="u"]', 1)->innertext;

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM