简体繁体 English

获取用于制作网络爬虫的页面的完整html源代码

[英]Getting a full html source code of a page for making a web crawler

原文 2015-10-14 11:06:01 5 1 java/ web-crawler/ jsoup

I'm trying to make a web crawler in java which takes URL of a webpage and navigates to other pages which are present in the source code of given web page. 我正在尝试用Java制作网络爬虫，该爬虫使用网页的URL并导航到给定网页的源代码中存在的其他页面。 The issue is, i'm getting the source code of HTML with the help of jsoup which contains various tags like frames and some javascript file names. 问题是，我正在jsoup的帮助下获取HTML的源代码，其中包含各种标签，例如帧和一些javascript文件名。 Now to navigate to other pages i need to access the http links given in frames and javascript files. 现在，要导航到其他页面，我需要访问框架和javascript文件中提供的http链接。 How should i get those links in a list. 我应该如何在列表中获取这些链接。

1 个解决方案

You need to do it recursively... found a frame tag/Element in a DOM Object, time to fetch DOM of its "src" attribute, keep doing it, store all the links that you find in subsequent fetching to an array. 您需要以递归的方式进行操作...在DOM对象中找到了一个框架标签/元素，是时候获取其“ src”属性的DOM了，继续这样做，并将在随后的获取中找到的所有链接存储到一个数组中。
You can use new threads to fetch frames DOM. 您可以使用新线程来获取帧DOM。 Just to make whole process little bit faster. 只是为了使整个过程更快一点。

是否可以存储在solr完整的html页面源代码中？ - Is it possible to store in solr full html page source code?

在Web爬虫中解析HTML - Parsing HTML in web crawler

从Web网址检索整页源 - Retrieving full page source from web url

获取页面源代码的问题 - Problem with getting source code of page

网络抓取工具与HTML解析器 - Web Crawler vs Html Parser

Java-从网页源代码中提取纯文本（从网站中获取大量歌词） - Java - Extracting plaintext from web-page source code (getting massive quantities of lyrics from website)

“整页”小程序的Html代码（没有javascript）？ - Html code for a “full-page” applet (no javascript)?

使用Java获取网页的源代码？ - Fetch source code of web page using java?

解析网页内容，而不是源代码 - Parse web page content, not source code

如何使用 GeckoView 获取网页源代码 - How to get a web page source code with GeckoView

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 是否可以存储在solr完整的html页面源代码中？ - Is it possible to store in solr full html page source code? 在Web爬虫中解析HTML - Parsing HTML in web crawler 从Web网址检索整页源 - Retrieving full page source from web url 获取页面源代码的问题 - Problem with getting source code of page 网络抓取工具与HTML解析器 - Web Crawler vs Html Parser Java-从网页源代码中提取纯文本（从网站中获取大量歌词） - Java - Extracting plaintext from web-page source code (getting massive quantities of lyrics from website) “整页”小程序的Html代码（没有javascript）？ - Html code for a “full-page” applet (no javascript)? 使用Java获取网页的源代码？ - Fetch source code of web page using java? 解析网页内容，而不是源代码 - Parse web page content, not source code 如何使用 GeckoView 获取网页源代码 - How to get a web page source code with GeckoView

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM