简体   繁体   English

IBM Watson Discovery爬行问题

[英]IBM Watson Discovery crawling issue

We want to index our client website and store all the data in IBM Watson Discovery service . 我们希望索引客户端网站并将所有数据存储在IBM Watson Discovery服务中 When user asks question related to client data then (we will connect discovery with Watson Assistant). 当用户询问与客户端数据相关的问题时(我们将使用Watson Assistant连接发现)。 The chatbot should connect to Discovery and fetch the data to respond. 聊天机器人应该连接到Discovery并获取数据以进行响应。

Problem: The client website has multiple links and each link will have further links, we want crawl all the data from website and index and store it in Watson Discovery service. 问题:客户端网站有多个链接,每个链接都有更多链接,我们希望抓取网站和索引中的所有数据并将其存储在Watson Discovery服务中。 We tried crawling the site but Discovery service is taking much time to crawl the site and also its not completed the task after 1 week also. 我们尝试抓取网站,但Discovery服务花了很多时间来抓取网站,并且还在1周后也没有完成任务。 Please let us know how we can achieve this in better and faster way. 请告诉我们如何以更好,更快的方式实现这一目标。

Note that the web crawling is a current beta and the Watson Discovery documentation for web crawl states that, depending on the website, it will not ingest all data. 请注意,网络抓取是当前的测试版, 网页抓取Watson Discovery文档指出,根据网站的不同,它不会提取所有数据。

I used the web crawl in Discovery in a similar scenario like yours and query my website using a chat built with Watson Assistant. 我在与您类似的场景中使用了Discovery中的Web抓取,并使用使用Watson Assistant构建的聊天来查询我的网站。 What you should do: 你应该做什么:

  • increase the number of hops: how deep should Watson Discovery crawl your website 增加跳数:Watson Discovery应该对您的网站进行多深的抓取
  • depending on your website: add multiple entry points 取决于您的网站:添加多个入口点
  • specify all the paths that you want to exclude. 指定要排除的所有路径。 I added those that would add duplicate entries and those generated summary pages, RSS feeds, etc. 我添加了那些会添加重复条目和那些生成的摘要页面,RSS提要等的内容。
  • adjust how often it should crawl 调整它应该爬行的频率
  • check that Watson Discovery can access your website and that your website does not block crawling 检查Watson Discovery是否可以访问您的网站,以及您的网站是否阻止抓取

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM