简体   繁体   中英

html parsing with php simple_html_dom

I'm parsing internet newspapers's columinst page. I have problem about this site

http://www.sozcu.com.tr/kategori/yazarlar/

the parsing was working fine in the starting but it stopped working.

Here's my code

$curl_handle=curl_init();
curl_setopt($curl_handle, CURLOPT_URL,$gazeteAdress);
//curl_setopt($curl_handle, CURLOPT_CONNECTTIMEOUT, 2);
curl_setopt($curl_handle, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'mozilla');
$query = curl_exec($curl_handle);
curl_close($curl_handle);
$html = new simple_html_dom();
$html->load($query);

I don't know why my code sometimes is not parsing the site, so I was thinking about connection_timeout. But It is not the problem, so I was thinking of printing html page with curl instead.

echo $html;

Here is result. (sometimes my code is not parsing html page properly) 在此处输入图片说明

why the html tags are not coming and why am seeing the result like this. Can anyone help ?

The content is returned compressed so you should specify Accept-Encoding with 'gzip,deflate' header for curl.

Please add this line
curl_setopt($curl_handle, CURLOPT_ENCODING, "gzip,deflate");
after this
curl_setopt($curl_handle, CURLOPT_USERAGENT, 'mozilla');

将其添加到您的php脚本之上

header('Content-Type: text/html; charset=utf-8');

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM