使用PHP提取HTML文档的正文文本

Question

I know it's better to use DOM for this purpose but let's try to extract the text in this way: 我知道为此目的使用DOM会更好，但让我们尝试以这种方式提取文本：

<?php


$html=<<<EOD
<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>
EOD;


        preg_match('/<body.*?>/', $html, $matches, PREG_OFFSET_CAPTURE);

        if (empty($matches))
            exit;

        $matched_body_start_tag = $matches[0][0];
        $index_of_body_start_tag = $matches[0][1];

        $index_of_body_end_tag = strpos($html, '</body>');


        $body = substr(
                        $html,
                        $index_of_body_start_tag + strlen($matched_body_start_tag),
                        $index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)
        );

echo $body;

The result can be seen here: http://ideone.com/vH2FZ 结果可以在这里看到： http ： //ideone.com/vH2FZ

As you can see, I am getting more text than expected. 如您所见，我收到的文字多于预期。

There is something I don't understand, to get the correct length for the substr($string, $start, $length) function, I am using: 有些东西我不明白，为了获得substr($string, $start, $length)函数的正确长度，我正在使用：

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

I don't see anything wrong with this formula. 我没有看到这个公式有什么问题。

Could somebody kindly suggest where the problem is? 有人可以建议问题出在哪里吗？

Many thanks to you all. 非常感谢大家。

EDIT: 编辑：

Thank you very very much to all of you. 非常感谢你们所有人。 There is just a bug in my brain. 我脑子里只有一个小虫。 After reading your answers, I now understand what the problem is, it should either be: 在阅读完答案后，我现在明白了问题所在，它应该是：

  $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag));

Or: 要么：

  $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag);

Answer 1

The problem is that your string have new lines where . 问题是你的字符串有新行。 in the pattern only matches single lines, you need to add /s modifier to make . 在模式中只匹配单行，你需要添加/ s修饰符来制作。 to match multi-lines 匹配多行

Here is my solution, I prefer it this way. 这是我的解决方案，我更喜欢这种方式。

<?php

$html=<<<EOD
<html>
<head>
</head>
<body buu="grger"     ga="Gag">
<p>Some text</p>
</body>
</html>
EOD;

    // get anything between <body> and </body> where <body can="have_as many" attributes="as required">
    if (preg_match('/(?:<body[^>]*>)(.*)<\/body>/isU', $html, $matches)) {
        $body = $matches[1];
    }
    // outputing all matches for debugging purposes
    var_dump($matches);
?>

Edit: I am updating my answer to provide you with better explanation why your code fails. 编辑：我正在更新我的答案，为您提供更好的解释为什么您的代码失败。

You have this string: 你有这个字符串：

<html>
<head>
</head>
<body>
<p>Some text</p>
</body>
</html>

Everything seems to be fine with it but actually you have non-print characters (new line characters) on each line. 一切似乎都很好，但实际上每行都有非打印字符（换行符）。 You have 53 printable characters and 7 non printable (new lines, \\n == 2 characters actually for each new line). 您有53个可打印字符和7个不可打印字符（新行，实际上每行为\\ n = = 2个字符）。

When you reach this part of the code: 当你到达这部分代码时：

$index_of_body_end_tag = strpos($html, '</body>');

You get the correct position of </body> (starting at position 51) but this counts the new lines. 你得到</ body>的正确位置（从第51位开始），但这会计算新行。

So when you reach this line of code: 所以当你到达这行代码时：

$index_of_body_start_tag + strlen($matched_body_start_tag)

It it evaluated to 31 (new lines included), and: 评估为31（包括新行），并且：

$index_of_body_end_tag - $index_of_body_start_tag + strlen($matched_body_start_tag)

It is evaluated to 51 - 25 + 6 = 32 (characters you have to read) but you only have 16 printable characters of text between <body> and </body> and 4 non printable characters (new line after <body> and new line before </body>). 它被评估为51 - 25 + 6 = 32（您必须阅读的字符），但在<body>和</ body>之间只有16个可打印的文本字符和4个不可打印的字符（<body>和new之后的新行）在</ body>之前的行。 And here is the problem, you have to group the calculation (prioritize) like so: 这就是问题所在，您必须将计算（优先级）分组，如下所示：

$index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))

evaluated to 51 - (25 + 6) = 51 - 31 = 20 (16 + 4). 评价为51-（25 + 6）= 51-31 = 20（16 + 4）。

:) Hope this helps you to understand why prioritizing is important. :)希望这可以帮助您理解为什么优先排序很重要。 (Sorry for misleading you about newlines it is only valid in regex example I gave above). （很抱歉误导你关于换行符，它只在我上面给出的正则表达式中有效）。

Answer 2

Personally, I wouldn't use regex. 就个人而言，我不会使用正则表达式。

<?php

$html = <<<EOD

<html>
    <head>
        <title>Example</title>
    </head>
    <body>
        <h1>foobar</h1>
    </body>
</html>

EOD;

$s = strpos($html, '<body>') + strlen('<body>');
$f = '</body>';

echo trim(substr($html, $s, strpos($html, $f) - $s));

?>

returns <h1>foobar</h1> 返回<h1>foobar</h1>

Answer 3

The problem is in your substr computation of the ending index. 问题出在你的结束索引的substr计算中。 You should substract all the way: 你应该一路减去：

$index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)

But you are doing: 但你在做：

+ strlen($matched_body_start_tag)

That said, it seems a little overkill considering you can do it using preg_match only . 这就是说，它似乎有点矫枉过正考虑您可以用做preg_match 只。 You just need to make sure you match across new lines, using the s modifier: 您只需要确保使用s修饰符匹配新行：

preg_match('/<body[^>]*>(.*?)<\/body>/s', $html, $matches);
echo $matches[1];

Outputs: 输出：

<p>Some text</p>

Answer 4

Somebodys probably already found your error, i didn't read all the replys. Somebodys可能已经发现了你的错误，我没有阅读所有的回复。
The algebra is wrong. 代数是错的。

code is here 代码在这里

Btw, first time seeing ideone.com, thats pretty cool. 顺便说一句，第一次看到ideone.com，这很酷。

$body = substr( 
          $html, 
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - ($index_of_body_start_tag + strlen($matched_body_start_tag))
        );

or .. 要么 ..

$body = substr(
          $html,
          $index_of_body_start_tag + strlen($matched_body_start_tag),
          $index_of_body_end_tag - $index_of_body_start_tag - strlen($matched_body_start_tag)
       );

使用PHP提取HTML文档的正文文本

问题描述

4 个解决方案

解决方案1
11 已采纳 2011-02-06 02:02:50

解决方案2
4 2011-02-06 02:07:58

解决方案3
2 2011-02-06 01:59:23

解决方案4
1 2011-02-06 05:33:45

使用PHP提取HTML文档的正文文本

问题描述

4 个解决方案

解决方案1 11 已采纳 2011-02-06 02:02:50

解决方案2 4 2011-02-06 02:07:58

解决方案3 2 2011-02-06 01:59:23

解决方案4 1 2011-02-06 05:33:45

解决方案1
11 已采纳 2011-02-06 02:02:50

解决方案2
4 2011-02-06 02:07:58

解决方案3
2 2011-02-06 01:59:23

解决方案4
1 2011-02-06 05:33:45