[英]site map generator, built from scratch
I'd like to know how to build a site crawler, in php, that detects each page of a website and generates an entry in a xml file. 我想知道如何在php中构建一个站点爬虫,它可以检测网站的每个页面并在xml文件中生成一个条目。 I've seen plenty of websites doing this so I'm curious how to do it from scratch or there is any script or tutorial to teach that. 我已经看到很多网站这样做,所以我很好奇如何从头开始,或者有任何脚本或教程来教它。
don't use regex. 不要使用正则表达式。 the proper way to parse html would be by using a DOMDocument object. 解析html的正确方法是使用DOMDocument对象。
http://www.php.net/manual/en/class.domdocument.php http://www.php.net/manual/en/class.domdocument.php
Here is the algorithm 这是算法
Step 1-> Get a site's address, verify the address is in correct format and it ends with a page (www.xyz.com/page.html) not like (www.xyz.com/). 步骤1->获取网站的地址,确认地址格式正确,并以页面(www.xyz.com/page.html)结束(www.xyz.com/)。
Step 2-> Get the contents of the file, using regular expression try to get the list of pages. 步骤2->获取文件的内容,使用正则表达式尝试获取页面列表。
Step 3-> Harvest them in the DB for future use and do the step 2 on those files too. 步骤3->在数据库中收集它们以备将来使用,并对这些文件执行步骤2。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.