简体   繁体   English

获取用于制作网络爬虫的页面的完整html源代码

[英]Getting a full html source code of a page for making a web crawler

I'm trying to make a web crawler in java which takes URL of a webpage and navigates to other pages which are present in the source code of given web page. 我正在尝试用Java制作网络爬虫,该爬虫使用网页的URL并导航到给定网页的源代码中存在的其他页面。 The issue is, i'm getting the source code of HTML with the help of jsoup which contains various tags like frames and some javascript file names. 问题是,我正在jsoup的帮助下获取HTML的源代码,其中包含各种标签,例如帧和一些javascript文件名。 Now to navigate to other pages i need to access the http links given in frames and javascript files. 现在,要导航到其他页面,我需要访问框架和javascript文件中提供的http链接。 How should i get those links in a list. 我应该如何在列表中获取这些链接。

You need to do it recursively... found a frame tag/Element in a DOM Object, time to fetch DOM of its "src" attribute, keep doing it, store all the links that you find in subsequent fetching to an array. 您需要以递归的方式进行操作...在DOM对象中找到了一个框架标签/元素,是时候获取其“ src”属性的DOM了,继续这样做,并将在随后的获取中找到的所有链接存储到一个数组中。
You can use new threads to fetch frames DOM. 您可以使用新线程来获取帧DOM。 Just to make whole process little bit faster. 只是为了使整个过程更快一点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM