PHP搜寻器从2个HTML读取所有数据

Question

How can I read all data with a crawler from a page that has 2 html tags, for example: 如何使用搜寻器从具有2个html标记的页面读取所有数据，例如：

<html>
<body>
text text text
</body>
</html>



text2 text2 text2 text
</body>
</html>

I need to replace the first closing html and body tags, and then to read all data. 我需要先替换关闭的html和body标签，然后读取所有数据。 How do I do that? 我怎么做？

Answer 1

You can use regular expressions to replace the first appearance of </body></html> , if there is one more pair of same tags after that: 如果之后有一对相同的标记，则可以使用正则表达式替换</body></html>的首次出现：

// https://regex101.com/r/nVuN8S/1
$regex = '/(?<replace><\/body>\s*<\/html>)(?=(?:.|\s)*<\/body>\s*<\/html>)/';
$new_html = preg_replace($regex, '', $html);

Here you look for </body> and </html> separated by any number of white space characters (eg new line). 在这里，您查找</body>和</html>并用任意数量的空格字符（例如换行符）分隔。 Then you use a positive lookahead to check if they are followed by any number of symbols, including white space, and by additional </body> and </html> tags after them. 然后，您可以使用正向前瞻来检查它们后面是否有任意数量的符号（包括空格）以及后面的其他</body>和</html>标签。

To read "all the data" (assuming that it means everything between the <body> tags), you may use another regex Eg: 要读取“所有数据”（假设这意味着<body>标记之间的所有内容），可以使用另一个正则表达式，例如：

// https://regex101.com/r/nVuN8S/2
$regex = '/<body>(?<data>(?:.|\s)+)<\/body>'/;

Of course, you may use a couple of different approaches to get the data: simple string manipulation (remove text before <body> and after </body> , and the tags themselves), DOM document functionality, etc. 当然，您可以使用几种不同的方法来获取数据：简单的字符串操作（删除<body>之前和</body>之后的文本，以及标签本身），DOM文档功能等。

PHP搜寻器从2个HTML读取所有数据

问题描述

1 个解决方案

解决方案1
0 2016-12-22 10:22:20

PHP搜寻器从2个HTML读取所有数据

问题描述

1 个解决方案

解决方案1 0 2016-12-22 10:22:20

解决方案1
0 2016-12-22 10:22:20