简体   繁体   English

在PHP中删除

[英]&nbsp removal in PHP

I need to remove all dodgy html characters from a web-site I'm parsing using Curl and simplehtml dom. 我需要从我正在使用Curl和simplehtml dom解析的网站中删除所有狡猾的html字符。

<?php
$html = "this is&nbsp;a text";
var_dump($html);
var_dump(html_entity_decode($html,ENT_COMPAT,"UTF-8"));

Which outputs 哪个输出

string(19) "this is a text" string(19)“这是一个文本”

string(15) "this is┬áa text" string(15)“这是一个文本”

I don't want to use preg* as there are other characters in the text (eg &deg). 我不想使用preg *,因为文本中还有其他字符(例如&deg)。 This is driving me insane now! 这让我疯了!

Thanks, James 谢谢,詹姆斯

You need to specify your output encoding with a header: 您需要使用标头指定输出编码:

<?php
    header('Content-Type: text/html; charset=utf-8');

    $html = "this is&nbsp;a text";
    var_dump($html);
    var_dump(html_entity_decode($html,ENT_COMPAT,"UTF-8"));
?>

The browser does not assume UTF-8 by default, that's why it displays the wrong character. 默认情况下,浏览器不会采用UTF-8,这就是显示错误字符的原因。

If that's the only character that needs replacing just use str_replace() 如果这是唯一需要替换的字符,请使用str_replace()

var_dump(str_replace('&nbsp;', ' ', "this is&nbsp;a text"));

See it in action 看到它在行动

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM