[英]Php crawler reading all data from 2 htmls
How can I read all data with a crawler from a page that has 2 html tags, for example: 如何使用搜寻器从具有2个html标记的页面读取所有数据,例如:
<html>
<body>
text text text
</body>
</html>
text2 text2 text2 text
</body>
</html>
I need to replace the first closing html and body tags, and then to read all data. 我需要先替换关闭的html和body标签,然后读取所有数据。 How do I do that?
我怎么做?
You can use regular expressions to replace the first appearance of </body></html>
, if there is one more pair of same tags after that: 如果之后有一对相同的标记,则可以使用正则表达式替换
</body></html>
的首次出现:
// https://regex101.com/r/nVuN8S/1
$regex = '/(?<replace><\/body>\s*<\/html>)(?=(?:.|\s)*<\/body>\s*<\/html>)/';
$new_html = preg_replace($regex, '', $html);
Here you look for </body>
and </html>
separated by any number of white space characters (eg new line). 在这里,您查找
</body>
和</html>
并用任意数量的空格字符(例如换行符)分隔。 Then you use a positive lookahead to check if they are followed by any number of symbols, including white space, and by additional </body>
and </html>
tags after them. 然后,您可以使用正向前瞻来检查它们后面是否有任意数量的符号(包括空格)以及后面的其他
</body>
和</html>
标签。
To read "all the data" (assuming that it means everything between the <body>
tags), you may use another regex Eg: 要读取“所有数据”(假设这意味着
<body>
标记之间的所有内容),可以使用另一个正则表达式,例如:
// https://regex101.com/r/nVuN8S/2
$regex = '/<body>(?<data>(?:.|\s)+)<\/body>'/;
Of course, you may use a couple of different approaches to get the data: simple string manipulation (remove text before <body>
and after </body>
, and the tags themselves), DOM document functionality, etc. 当然,您可以使用几种不同的方法来获取数据:简单的字符串操作(删除
<body>
之前和</body>
之后的文本,以及标签本身),DOM文档功能等。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.