如何从网页内容中提取链接？

Question

I have download a web page and I want to extract all the links in that file. 我已经下载了一个网页，并且想要提取该文件中的所有链接。 this links include absolutes and relatives. 此链接包括绝对和亲戚。 for example we have : 例如，我们有：

<script type="text/javascript" src="/assets/jquery-1.8.0.min.js"></script>

or 要么

<a href="http://stackoverflow.com/" />

so after reading the file, what should I do? 所以读完文件后该怎么办？

Answer 1

This isn't that complicated to do, if you want to use the builtin regex system from Java. 如果您要使用Java内置的regex系统，则要做的事情并不复杂。 The hard bit is finding the right regex to match URLs ^[1][2] . 困难的是找到合适的正则表达式来匹配URL ^{[1] [2]} 。 For the sake of the answer, I'm gonna just assume you've done that, and stored that as a Pattern with syntax along the lines of this: 为了得到答案，我将假设您已经完成了该任务，并将其存储为Pattern并带有如下语法：

Pattern url = Pattern.compile("your regex here");

and some way of iterating through each line. 以及在每一行中进行迭代的某种方式。 What you'll want to do is define an ArrayList<String> : 您要做的是定义一个ArrayList<String> ：

ArrayList<String> urlsFound = new ArrayList<>();

From there, you'll have some loop to iterate through your file (assuming each line is a <? extends CharSequence> line ), and inside you'll put this: 从那里开始，您将有一些循环来循环访问文件（假设每行都是<? extends CharSequence> line ），然后在其中放入：

Matcher urlMatch = url.matcher(line);
while (urlMatch.find()) urlsFound.add(urlMatch.match());

What this does is create a Matcher for your line and the URL-matching Pattern from before. 这样做是为您的线路和之前的URL匹配Pattern创建一个Matcher 。 Then, it loops until #find() returns false (ie, there are no more matches) and adds the match (with #group() ) to the list, urlsFound . 然后，它循环直到#find()返回false（即没有更多匹配项）并将匹配项（带有#group() ）添加到列表urlsFound 。

At the end of your loop, urlsFound will contain all the matches for all of the URLs on the page. 在循环结束时， urlsFound将包含页面上所有URL的所有匹配项。 Note that this can get quite memory-intensive if you've got a lot of text, as urlsFound will get quite big, and you'll be creating and ditching a lot of Matcher s. 请注意，如果您有很多文本，这可能会占用大量内存，因为urlsFound会变得很大，并且您将创建和放弃很多Matcher 。

^{1: I found a few good sites with a quick Google search ;} ^{1：我通过Google快速搜索找到了一些不错的网站；} ^{the cream of the crop seem to be here and here , as far as I can tell.} ^{据我所知，这种作物的奶油似乎在这里和这里。} ^{Your needs may vary.} ^{您的需求可能会有所不同。}

^{2: You'll need to make sure that the entire URL is captured with a single group, or this won't work at all.} ^{2：您需要确保使用单个组捕获整个URL，否则将根本无法使用。} ^{It can be tweaked to work if there are multiple parts, though.} ^{但是，如果有多个部分，则可以对其进行调整以使其工作。}

如何从网页内容中提取链接？

问题描述

1 个解决方案

解决方案1
0 2015-04-26 22:53:38

如何从网页内容中提取链接？

问题描述

1 个解决方案

解决方案1 0 2015-04-26 22:53:38

解决方案1
0 2015-04-26 22:53:38