[英]How to extract links from a web content?
I have download a web page and I want to extract all the links in that file. 我已经下载了一个网页,并且想要提取该文件中的所有链接。 this links include absolutes and relatives.
此链接包括绝对和亲戚。 for example we have :
例如,我们有:
<script type="text/javascript" src="/assets/jquery-1.8.0.min.js"></script>
or 要么
<a href="http://stackoverflow.com/" />
so after reading the file, what should I do? 所以读完文件后该怎么办?
This isn't that complicated to do, if you want to use the builtin regex system from Java. 如果您要使用Java内置的regex系统,则要做的事情并不复杂。 The hard bit is finding the right regex to match URLs [1][2] .
困难的是找到合适的正则表达式来匹配URL [1] [2] 。 For the sake of the answer, I'm gonna just assume you've done that, and stored that as a
Pattern
with syntax along the lines of this: 为了得到答案,我将假设您已经完成了该任务,并将其存储为
Pattern
并带有如下语法:
Pattern url = Pattern.compile("your regex here");
and some way of iterating through each line. 以及在每一行中进行迭代的某种方式。 What you'll want to do is define an
ArrayList<String>
: 您要做的是定义一个
ArrayList<String>
:
ArrayList<String> urlsFound = new ArrayList<>();
From there, you'll have some loop to iterate through your file (assuming each line is a <? extends CharSequence> line
), and inside you'll put this: 从那里开始,您将有一些循环来循环访问文件(假设每行都是
<? extends CharSequence> line
),然后在其中放入:
Matcher urlMatch = url.matcher(line);
while (urlMatch.find()) urlsFound.add(urlMatch.match());
What this does is create a Matcher
for your line and the URL-matching Pattern
from before. 这样做是为您的线路和之前的URL匹配
Pattern
创建一个Matcher
。 Then, it loops until #find()
returns false (ie, there are no more matches) and adds the match (with #group()
) to the list, urlsFound
. 然后,它循环直到
#find()
返回false(即没有更多匹配项)并将匹配项(带有#group()
)添加到列表urlsFound
。
At the end of your loop, urlsFound
will contain all the matches for all of the URLs on the page. 在循环结束时,
urlsFound
将包含页面上所有URL的所有匹配项。 Note that this can get quite memory-intensive if you've got a lot of text, as urlsFound
will get quite big, and you'll be creating and ditching a lot of Matcher
s. 请注意,如果您有很多文本,这可能会占用大量内存,因为
urlsFound
会变得很大,并且您将创建和放弃很多Matcher
。
1: I found a few good sites with a quick Google search ; 1:我通过Google快速搜索找到了一些不错的网站; the cream of the crop seem to be here and here , as far as I can tell.
据我所知,这种作物的奶油似乎在这里和这里 。 Your needs may vary.
您的需求可能会有所不同。
2: You'll need to make sure that the entire URL is captured with a single group, or this won't work at all. 2:您需要确保使用单个组捕获整个URL,否则将根本无法使用。 It can be tweaked to work if there are multiple parts, though.
但是,如果有多个部分,则可以对其进行调整以使其工作。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.