简体   繁体   中英

How to extract links from a web content?

I have download a web page and I want to extract all the links in that file. this links include absolutes and relatives. for example we have :

<script type="text/javascript" src="/assets/jquery-1.8.0.min.js"></script>

or

<a href="http://stackoverflow.com/" />

so after reading the file, what should I do?

This isn't that complicated to do, if you want to use the builtin regex system from Java. The hard bit is finding the right regex to match URLs [1][2] . For the sake of the answer, I'm gonna just assume you've done that, and stored that as a Pattern with syntax along the lines of this:

Pattern url = Pattern.compile("your regex here");

and some way of iterating through each line. What you'll want to do is define an ArrayList<String> :

ArrayList<String> urlsFound = new ArrayList<>();

From there, you'll have some loop to iterate through your file (assuming each line is a <? extends CharSequence> line ), and inside you'll put this:

Matcher urlMatch = url.matcher(line);
while (urlMatch.find()) urlsFound.add(urlMatch.match());

What this does is create a Matcher for your line and the URL-matching Pattern from before. Then, it loops until #find() returns false (ie, there are no more matches) and adds the match (with #group() ) to the list, urlsFound .

At the end of your loop, urlsFound will contain all the matches for all of the URLs on the page. Note that this can get quite memory-intensive if you've got a lot of text, as urlsFound will get quite big, and you'll be creating and ditching a lot of Matcher s.

1: I found a few good sites with a quick Google search ; the cream of the crop seem to be here and here , as far as I can tell. Your needs may vary.

2: You'll need to make sure that the entire URL is captured with a single group, or this won't work at all. It can be tweaked to work if there are multiple parts, though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM