简体   繁体   English

如何 WEB 抓取 XML 站点地图

[英]how to WEB SCRAPE XML SITEMAP

I am trying to figure out if it is possible to scrape a website's sitemap and every time that there is a change to the sitemap, record/log what was changed and deliver this to an email address or telegram account.我试图弄清楚是否可以抓取网站的站点地图,并且每次站点地图发生更改时,记录/记录更改的内容并将其发送到电子邮件地址或电报帐户。

Does anyone know if this is possible, if so where to get started?有谁知道这是否可行,如果可以,从哪里开始?

Thanks谢谢

I am assuming you already scrape the sitemap.我假设您已经抓取了站点地图。

Yes, it is possible.对的,这是可能的。 You need to schedule the task which automatically triggers after regular intervals.您需要安排在定期间隔后自动触发的任务。

In this task, you need to read the sitemap of the website and save all the URLs in the database.在此任务中,您需要读取网站的站点地图并将所有 URL 保存在数据库中。 You have to add the condition of either the URL is already in the database or not.您必须添加 URL 是否已经在数据库中的条件。 If the URL is new and not available in the database then you will send that URL to email/telegram and also add that URL to the database.如果 URL 是新的并且在数据库中不可用,那么您将将该 URL 发送到电子邮件/电报并将该 URL 添加到数据库中。

Every time the scheduler tasks run it will find all the new URLs and will send you in the email and update the database.每次调度程序任务运行时,它都会找到所有新的 URL,并将通过电子邮件发送给您并更新数据库。 Hope this is helpful.希望这是有帮助的。

If you didn't scrape the sitemap yet.如果您还没有抓取站点地图。 You can do this by JSOUP/Scrapy.你可以通过 JSOUP/Scrapy 来做到这一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM