简体   繁体   中英

Getting filename and extension from friendly url in java

I'm writing a small java-program for downloading blacklists from the Internet.
The URLs can be of two types:
1) direct link, eg: http://www.shallalist.de/Downloads/shallalist.tar.gz
Absolutely no problem here, we can use some library, such as: apache.commons.io.FilenameUtils; or simply look for the last occurrence of "/" and "."
2) "frienly url", which is something like: http://urlblacklist.com/cgi-bin/commercialdownload.pl?type=download&file=bigblacklist
Here no explicit filename and extension is present, but if I use my browser or Internet Download Manager (IDM), filename+extension would be: "bigblacklist.tar.gz"
How to solve this problem in java and get filenames and extensions from "friendly" URLs?

PS: I know about Content-Disposition and Content-Type fields, but the Response Header for the urlblacklist link is:

Transfer-Encoding : [chunked]
Keep-Alive : [timeout=5, max=100]
null : [HTTP/1.1 200 OK]
Server : [Apache/2.4.10 (Debian)]
Connection : [Keep-Alive]
Date : [Sat, 05 Sep 2015 23:51:35 GMT]
Content-Type : [ application/octet-stream]

As we see, there's nothing connected with .gzip (.gz). How to deal with it using java?
And how do web browsers and download managers recognize the correct name and extension?

===============UPDATE=====================
Thanks to @eugenioy, the problem was solved. The real trouble was in IP-blocking for my multiple downloading attempts, that's why I decided to use proxies. Now it looks like (for the both types of URL) :

Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyIP, port));
HttpURLConnection httpConn = (HttpURLConnection) new URL(downloadFrom).openConnection(proxy);
String disposition = httpConn.getHeaderField("Content-Disposition");
if (disposition != null) {
// extracts file name from header field
    int index = disposition.indexOf("filename");
    if  (index > 0) {
        fullFileName = disposition.substring(disposition.lastIndexOf("=") + 1, disposition.length() );
    }
} else {
// extracts file name from URL
    fullFileName = downloadFrom.substring(downloadFrom.lastIndexOf("/") + 1, downloadFrom.length());
            }

Now fullFileName contains the name of the file to download + its extension.

Take a look at the output from curl:

curl -s -D - 'http://urlblacklist.com/cgi-bin/commercialdownload.pl?type=download&file=bigblacklist' -o /dev/null

You will see this response:

HTTP/1.1 200 OK
Date: Sun, 06 Sep 2015 00:55:51 GMT
Server: Apache/2.4.10 (Debian)
Content-disposition: attachement; filename=bigblacklist.tar.gz
Content-length: 22840787
Content-Type: application/octet-stream

I gues that's how the browsers get the filename and extension:

Content-disposition: attachement; filename=bigblacklist.tar.gz

Or to do it from Java:

    URL obj = new URL("http://urlblacklist.com/cgi-bin/commercialdownload.pl?type=download&file=bigblacklist");
    URLConnection conn = obj.openConnection();
    String disposition = conn.getHeaderField("Content-disposition");
    System.out.println(disposition);

NOTE : The servers seems to block your IP after trying several times, so make sure to try this from a "clean" IP if you already tried many times today.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM