简体   繁体   English

站点地图生成器,从头开始构建

[英]site map generator, built from scratch

I'd like to know how to build a site crawler, in php, that detects each page of a website and generates an entry in a xml file. 我想知道如何在php中构建一个站点爬虫,它可以检测网站的每个页面并在xml文件中生成一个条目。 I've seen plenty of websites doing this so I'm curious how to do it from scratch or there is any script or tutorial to teach that. 我已经看到很多网站这样做,所以我很好奇如何从头开始,或者有任何脚本或教程来教它。

don't use regex. 不要使用正则表达式。 the proper way to parse html would be by using a DOMDocument object. 解析html的正确方法是使用DOMDocument对象。

  1. Load the first page into a DOMDocument object. 将第一页加载到DOMDocument对象中。
  2. Use XPath statements to gather all of the anchor tag hrefs foudn in that page. 使用XPath语句收集该页面中的所有锚标记hrefs。
  3. Use those values to find more pages to load, to start over with on step one again. 使用这些值可以查找要加载的更多页面,从而重新开始第一步。

http://www.php.net/manual/en/class.domdocument.php http://www.php.net/manual/en/class.domdocument.php

Here is the algorithm 这是算法
Step 1-> Get a site's address, verify the address is in correct format and it ends with a page (www.xyz.com/page.html) not like (www.xyz.com/). 步骤1->获取网站的地址,确认地址格式正确,并以页面(www.xyz.com/page.html)结束(www.xyz.com/)。
Step 2-> Get the contents of the file, using regular expression try to get the list of pages. 步骤2->获取文件的内容,使用正则表达式尝试获取页面列表。
Step 3-> Harvest them in the DB for future use and do the step 2 on those files too. 步骤3->在数据库中收集它们以备将来使用,并对这些文件执行步骤2。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM