How can i scrape invalid html using php simple dom?

Question

I'm trying to scrape a webpage using phpsimpledom.

$html = '<div class="namepageheader"> 
            <div class="u">Name: <a href="someurl">Noor Shaad</a>
            <div class="u">Age: </div>
        </div> ' 
$name=$html->find('div[class="u"]', 0)->innertext;
$age=$html->find('div[class="u"]', 1)->innertext;

I tried my best to get text from each class="u" but it didn't work because there is missing closing tag </div> on first tag <div class="u"> . Can anyone help me out with that....

Answer 1

You can find an element close to where the tag should have been closed and then standardize the html by replacing it. For example, you can replace the </a> tag by </a></div> .

str_replace('</a>','</a></div>',$html)

or if there are too many closed </a> tags, replace </a><div class="u"> with </a></div><div class="u">

str_replace('</a><div class="u">','</a></div><div class="u">',$html)

There may be another problem. There is a gap between the tags and the replacement does not work properly. To solve this problem, you can first delete the spaces between the tags and then replace them.

$html = '<div class="namepageheader"> 
            <div class="u">Name: <a href="someurl">Noor Shaad</a>
            <div class="u">Age: </div>
        </div> ' ;
$html = preg_replace('~>\\s+<~m', '><', $html);
str_replace('</a><div class="u">','</a></div><div class="u">',$html);
$name=$html->find('div[class="u"]', 0)->innertext;
$age=$html->find('div[class="u"]', 1)->innertext;

How can i scrape invalid html using php simple dom?

Question

1 answers

solution1
1 2021-07-21 13:04:00

How can i scrape invalid html using php simple dom?

Question

1 answers

solution1 1 2021-07-21 13:04:00

solution1
1 2021-07-21 13:04:00