简体   繁体   English

PHP搜寻器从2个HTML读取所有数据

[英]Php crawler reading all data from 2 htmls

How can I read all data with a crawler from a page that has 2 html tags, for example: 如何使用搜寻器从具有2个html标记的页面读取所有数据,例如:

<html>
<body>
text text text
</body>
</html>



text2 text2 text2 text
</body>
</html>

I need to replace the first closing html and body tags, and then to read all data. 我需要先替换关闭的html和body标签,然后读取所有数据。 How do I do that? 我怎么做?

You can use regular expressions to replace the first appearance of </body></html> , if there is one more pair of same tags after that: 如果之后有一对相同的标记,则可以使用正则表达式替换</body></html>的首次出现:

// https://regex101.com/r/nVuN8S/1
$regex = '/(?<replace><\/body>\s*<\/html>)(?=(?:.|\s)*<\/body>\s*<\/html>)/';
$new_html = preg_replace($regex, '', $html);

Here you look for </body> and </html> separated by any number of white space characters (eg new line). 在这里,您查找</body></html>并用任意数量的空格字符(例如换行符)分隔。 Then you use a positive lookahead to check if they are followed by any number of symbols, including white space, and by additional </body> and </html> tags after them. 然后,您可以使用正向前瞻来检查它们后面是否有任意数量的符号(包括空格)以及后面的其他</body></html>标签。

To read "all the data" (assuming that it means everything between the <body> tags), you may use another regex Eg: 要读取“所有数据”(假设这意味着<body>标记之间的所有内容),可以使用另一个正则表达式,例如:

// https://regex101.com/r/nVuN8S/2
$regex = '/<body>(?<data>(?:.|\s)+)<\/body>'/;

Of course, you may use a couple of different approaches to get the data: simple string manipulation (remove text before <body> and after </body> , and the tags themselves), DOM document functionality, etc. 当然,您可以使用几种不同的方法来获取数据:简单的字符串操作(删除<body>之前和</body>之后的文本,以及标签本身),DOM文档功能等。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM