简体   繁体   English

php整洁奇怪的行为

[英]php tidy strange behaviour

I'm using php's tidy library to "clean and repair" some html coming from user input. 我正在使用php的整洁库来“清理和修复”来自用户输入的一些html。

Everything works fine, but i'm running into a problem that I can't figure out what its cause is. 一切正常,但我遇到了一个问题,我无法弄清楚它的原因是什么。 My code is like this: 我的代码是这样的:

$tidy = new tidy();

    $tidy_options = array(
        'hide-comments' => true,'tidy-mark' => false, 'indent' => false,
        'new-blocklevel-tags' => 'article,footer,header,hgroup,output,progress,section,video',
        'new-inline-tags' => 'audio,details,time,ruby,rt,rp',
        'drop-empty-paras' => false, 
        'doctype' => '<!DOCTYPE HTML>',
        'sort-attributes' => 'none', 'vertical-space' => false,
        'output-xhtml' => true,'wrap' => 180,
        'wrap-attributes' => false,
        'break-before-br' => false,
        'show-body-only' => true
        );
$data = $tidy->repairString($data, $tidy_options, 'UTF8');
echo $data;

This works for all kinds of input, except when i'm trying to use html for embeding swf files. 这适用于所有类型的输入,除非我正在尝试使用html来嵌入swf文件。
So , i try this code: 所以,我尝试这个代码:

<object data="http://the_swf_file_url" type="application/x-shockwave-flash" width="853" height="520"> 
    <param name="movie" value="http://the_swf_file_url"> 
</object>

but repairString stripes off all of it, and returns an empty string. 但是RepairString对它进行了条纹处理,并返回一个空字符串。
The strangest thing is that: 最奇怪的是:
-If i enter some text along with the above, so the input is like Hello world<object...>...</object> then it works fine. - 如果我输入一些文本和上面的内容,那么输入就像Hello world<object...>...</object>那么它工作正常。
-Or if i specify 'show-body-only' => false it also works fine! - 或者如果我指定'show-body-only' => false它也可以正常工作!

Any clue Why this is happening? 任何线索为什么会这样? Thanks in advance. 提前致谢。

Edit: tried pankar's suggestion with setting preserve-entities to true but had no luck... 编辑:尝试了pankar的建议,将preserve-entities设置为true但没有运气...

The problem is that you are trying to process an HTML fragment . 问题是您正在尝试处理HTML 片段

When you do this, the rest of the document is inferred . 执行此操作时,将推断文档的其余部分。 If you leave the configuration as default, and output a tidy document with just a piece of text, you will see the DOCTYPE , html , head and body tags that you did not give it. 如果您将配置保留为默认配置,并输出仅带有一段文本的整洁文档,您将看到未提供的DOCTYPEhtmlheadbody标签。 It inferred that these tags had to exist. 它推断出这些标签必须存在。

The problem here is that the HTML specification regarding objects states that: 这里的问题是关于对象HTML规范声明:

The OBJECT element may also appear in the content of the HEAD element. OBJECT元素也可以出现在HEAD元素的内容中。

When the location of your fragment is being inferred, it puts it in the first place that it can occur. 当推断出片段的位置时,它会将其置于可能出现的位置。 This means that tidy will place it in the head tag. 这意味着整洁将它放在head标签中。

The reason why show-body-only is affecting your output is because your fragment did not get placed in the body . show-body-only影响输出的原因是因为你的片段没有放在body


However when you add some text, it forces your snippet into the body tag. 但是,当您添加一些文本时,它会强制您的代码段进入body标记。 This is because raw text is not allowed in the head tag. 这是因为head标签中不允许使用原始文本。 So the logically inferred location of your fragment is in the body . 因此,片段的逻辑推断位置在body

In my opinion, the best option available to you is to inject all of your code fragments into a "template" document, and then parse them out again afterwards. 在我看来,您可以选择的最佳选择是将所有代码片段注入“模板”文档,然后再将其解析出来。 You can probably do this fairly easily with DOMDocument . 您可以使用DOMDocument轻松地完成此操作。

A second solution would be to inject a sentinel value that you can strip out again afterwards, when showing only the body. 第二个解决方案是注入一个哨兵值,然后在仅显示身体时可以再次剥离。

Ie

 ____MY_MAGIC_TOKEN____ <object ...></object> 

Then you can strip it out again afterwards. 之后你可以再把它剥掉。

Try specifying the configuration option preserve-entities to true (is defaulted to false ). 尝试将 配置选项 preserve-entities指定为 true (默认为 false )。

EDIT 编辑

Seconds (more thorough) thoughts. 秒(更彻底)的想法。 This is an expected behavior. 这是预期的行为。 By setting show-body-only to true you tell tidy to output the body part of the xhtml processed document. 通过将show-body-only设置为true您可以告诉tidy输出xhtml处理文档的正文部分。

This setting will actually ignore everything in the <head> of the document. 此设置实际上会忽略文档<head>中的所有内容。 <object> component is a child of <head> . <object>组件是<head>的子组件。 You can verify this by simply specifying 您只需指定即可验证

$data = "<title>My Site</title>" . $data = "<title>My Site</title>"

The output again will be blank. 输出再次为空白。

Your attempt to put prefixed text to <object> tag simply tricks tidy as it comes to believe that this data has to be handled as part of the body of the page and thus to be displayed. 您尝试将带前缀的文本放到<object>标签只是简单的诡计,因为它认为这些数据必须作为页面正文的一部分进行处理,从而进行显示。

Hope it helps more this time. 希望这次能帮助更多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM