简体   繁体   English

使用Java从网站下载多个文件

[英]Download multiple files from a website with java

I am logging onto a secure site using proxy and want to be able to download all the files and folders onto my local disc. 我正在使用代理登录到安全站点,并希望能够将所有文件和文件夹下载到本地磁盘上。 This is what i have so far. 这是我到目前为止所拥有的。

EDIT -**Currently the code below will start at a given root directory and download all files in all subdirectories ... pretty cool :) but it doesnt duplicate the folder structure which is what i need. 编辑-**目前,下面的代码将从给定的根目录开始,并下载所有子目录中的所有文件...非常酷:),但它不会复制我需要的文件夹结构。 Any help please? 有什么帮助吗? **EDIT **编辑

First of all i get 4 arguments (so can be used on cmd line on Linux) 首先我得到4个参数(因此可以在Linux的cmd行上使用)

1) url of directory i want to download 2) username of secure login 3) psw 4) directory where do i want the files saved on my local disc 1)我要下载的目录的URL 2)安全登录的用户名3)psw 4)我要在哪里将文件保存在本地光盘上的目录

       public class ApacheUrl4
{
// this is the entry point for what I want the instase of the class to do
    public static void main(String args[]) throws Exception {

        String url  = args[0];
        final String username  = args[1];
        final String password1  = args[2];
        String directory  = args[3];

        checkArguments(args);

        ApacheUrl4 max = new ApacheUrl4();
        max.process(url, username, password1, directory);

    }
    public void process (String url, String username1, String password1, String directory) throws Exception {

        final char[] password  = password1.toCharArray();   
        final String username = username1;
         Authenticator.setDefault(new Authenticator(){
              protected  PasswordAuthentication  getPasswordAuthentication(){
               PasswordAuthentication p=new PasswordAuthentication(username , password);
               return p;
              }
             });


        BufferedInputStream in = null;
        BufferedInputStream in2 = null;
        FileOutputStream fout = null;
    // proxy 
        String proxyip = "000.000.000" ;
        int proxyport = 8080;
        Proxy proxy = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxyip, proxyport));
     // URL connection to file 
        URL file = new URL(url);
        URLConnection connection = file.openConnection(proxy);      
        ((HttpURLConnection)connection).getResponseCode();
        int reponsecode = ((HttpURLConnection)connection).getResponseCode();
        System.out.println("response code " + reponsecode);


        if (reponsecode == HttpURLConnection.HTTP_FORBIDDEN){
            System.out.println("Invalid username or psw");
            return;
        }
        if (reponsecode != HttpURLConnection.HTTP_OK){
            System.out.println("Unable to find response");
            return;
        }





        //Save the file into the chosen folder
        in = new BufferedInputStream(connection.getInputStream());

        //Create instance of DocumentBuilderFactory
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        //Get the DocumentBuilder
        DocumentBuilder docBuilder = factory.newDocumentBuilder();
        //Using existing XML Document
        Document doc = docBuilder.parse(in);

        //create the root element 
        Element root = doc.getDocumentElement();
        NodeList nodeList = root.getElementsByTagName("li"); 


        for(int i=0; i<nodeList.getLength(); i++){
          Node childNode = nodeList.item(i);
          if (childNode.getTextContent().contains("/")) {


            //  System.out.println(url + childNode.getTextContent());
                process(url + childNode.getTextContent(), username, password1, directory);                        

        }

    if (childNode.getTextContent().contains(".") && !childNode.getTextContent().contains("..")) {


            String textcon =  url + childNode.getTextContent();
            System.out.println("aaa " + textcon);

            if (url.endsWith("/")) {
                System.out.println("ends with a /");    
            }

            textcon = textcon.replace( " ", "%20");
            URL file2 = new URL(textcon);

            String[] urlparts = textcon.split("/");
            int urllength = urlparts.length;
            String lastarray = urlparts[urllength-2];
            System.out.println("last array " + lastarray);


            URLConnection connection2 = file2.openConnection(proxy);        
            in2 = new BufferedInputStream(connection2.getInputStream());
            String test2 = childNode.getTextContent();
            System.out.println("eeee " + childNode.getTextContent());

            String filename = (directory + test2 );
              File f=new File(filename);
                  if(f.isDirectory())
                  continue;





              //InputStream inputStream= new FileInputStream("InputStreamToFile.java");
              OutputStream out=new FileOutputStream(f);
              byte buf[]=new byte[1024];
              int len;
              while((len=in2.read(buf))>0)
              out.write(buf,0,len);
              out.close();
              in2.close();


        }
    }
}




    // this is part of the validation of arguments provided by user
    private static void checkArguments(String[] args) {
        while (args.length < 4 || args[0].isEmpty() || args.length > 4 ) {
                System.out.println("Please specify five arguments in the following format \n "  +
                " URL USERNAME PASWORD FILEPATH FILENAME " +
                "EG: \"java helloW http://www.google.com user_name password C:\\path/dir/ filename.exe\" ");
                System.exit(1);
         }
    }
}

In order to download the files in the directory, you first need the directory listing. 为了下载目录中的文件,首先需要目录清单。 This is generated automatically by the server, if it is allowed. 如果允许,它由服务器自动生成。 First, use your browser to check if this is the case on this specific server. 首先,使用浏览器检查此特定服务器上是否存在这种情况。

Then you will need to parse the listing page, and download each url. 然后,您将需要解析列表页面,并下载每个URL。 The bad news is that there is no standard for these pages. 坏消息是这些页面没有标准。 The good news is that most of the internet is hosted on apache or IIS, so if you can manage these two, you've got a good part covered. 好消息是,大多数Internet都托管在apache或IIS上,因此,如果您可以管理这两者,那么您将获得很大的好处。

You could probably get away with just parsing the file as xml (xhtml) and using xpath to recover all the urls. 您可能只通过将文件解析为xml(xhtml)并使用xpath恢复所有网址就可以摆脱困境。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM