简体   繁体   中英

Java: How to extract the links to images in a HTML file using Jsoup library for Java

I have a HTML file that I want to read using Jsoup and export the results to an excel sheet. In that process, I want to extract the links(src) of all the images present in the HTML file.

Here's the code snippet that I have used to do the same:

File myhtml = new File("D:\\Projects\\Java\\report.html");
            //get the string from the file myhtml
            String str = getFileString(myhtml);

            //getting the links to the images as in the html file
            Document doc = Jsoup.parseBodyFragment(str);
            Elements media = doc.select("[src]");

            //System.out.println(media.size());
            for(Element imageLink:media)
            {

                if(imageLink.tagName().equals("img"))
                    //storing the local link to image as global variable in imlink
                    P1.imlink = imageLink.attr("src").toString();
System.out.println(P1.imlink);
            }

        }

I have two images in the HTML file that I want the links for. However, the code that I have written shows the link to only the first image present in the file. Please help me finding out the error in my code!

  //Dom ex............

  import org.w3c.tidy.*;

  import java.io.*;

  import java.net.*;

  import org.w3c.dom.*;

  import java.util.*;

  public class demo

  {

  public static void main(String arg[])

  {

  try

  {

  InputStream input = new URL("http://www.southreels.com").openStream();

  Document document = new Tidy().parseDOM(input, null);

  NodeList imgs = document.getElementsByTagName("img");

  List<String> srcs = new ArrayList<String>();

  for (int i = 0; i < imgs.getLength(); i++) {

  srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());

  }

  int i=0;

  for (String src: srcs) {

  System.out.println(i+"  "+src);

  i++;

  String file =System.getProperty("user.dir")+System.getProperty("file.separator");

  URL server = new URL(src);

  HttpURLConnection connection = (HttpURLConnection)server.openConnection();

  InputStream is = connection.getInputStream();

   OutputStream os = new FileOutputStream(file+"demo"+i+".jpg");

   byte[] buffer = new byte[1024];

  int byteReaded = is.read(buffer);

  while(byteReaded != -1)

  {

  os.write(buffer,0,byteReaded);

  byteReaded = is.read(buffer);

  }

   os.close();

  }

  }

   catch(Exception e)

  {

  }

  }

  }

Try this here:

File f = new File("D:\\Projects\\Java\\report.html");

Document doc = Jsoup.parse(f, null, ""); // set proper Charset (2nd param) and BaseUri (3rd param) here
Elements elements = doc.select("img[src]");

for( Element element : elements )
{
    // Do something with your links here ...
    System.out.println(element.attr("src"));
}

Btw. maybe your problem is the part where you store the link into a global variable. This is overwritten everytime you run through the loop. A better solution is storing the link into a List or leave the loop after first hit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM