I have a HTML file that I want to read using Jsoup and export the results to an excel sheet. In that process, I want to extract the links(src) of all the images present in the HTML file.
Here's the code snippet that I have used to do the same:
File myhtml = new File("D:\\Projects\\Java\\report.html");
//get the string from the file myhtml
String str = getFileString(myhtml);
//getting the links to the images as in the html file
Document doc = Jsoup.parseBodyFragment(str);
Elements media = doc.select("[src]");
//System.out.println(media.size());
for(Element imageLink:media)
{
if(imageLink.tagName().equals("img"))
//storing the local link to image as global variable in imlink
P1.imlink = imageLink.attr("src").toString();
System.out.println(P1.imlink);
}
}
I have two images in the HTML file that I want the links for. However, the code that I have written shows the link to only the first image present in the file. Please help me finding out the error in my code!
//Dom ex............
import org.w3c.tidy.*;
import java.io.*;
import java.net.*;
import org.w3c.dom.*;
import java.util.*;
public class demo
{
public static void main(String arg[])
{
try
{
InputStream input = new URL("http://www.southreels.com").openStream();
Document document = new Tidy().parseDOM(input, null);
NodeList imgs = document.getElementsByTagName("img");
List<String> srcs = new ArrayList<String>();
for (int i = 0; i < imgs.getLength(); i++) {
srcs.add(imgs.item(i).getAttributes().getNamedItem("src").getNodeValue());
}
int i=0;
for (String src: srcs) {
System.out.println(i+" "+src);
i++;
String file =System.getProperty("user.dir")+System.getProperty("file.separator");
URL server = new URL(src);
HttpURLConnection connection = (HttpURLConnection)server.openConnection();
InputStream is = connection.getInputStream();
OutputStream os = new FileOutputStream(file+"demo"+i+".jpg");
byte[] buffer = new byte[1024];
int byteReaded = is.read(buffer);
while(byteReaded != -1)
{
os.write(buffer,0,byteReaded);
byteReaded = is.read(buffer);
}
os.close();
}
}
catch(Exception e)
{
}
}
}
Try this here:
File f = new File("D:\\Projects\\Java\\report.html");
Document doc = Jsoup.parse(f, null, ""); // set proper Charset (2nd param) and BaseUri (3rd param) here
Elements elements = doc.select("img[src]");
for( Element element : elements )
{
// Do something with your links here ...
System.out.println(element.attr("src"));
}
Btw. maybe your problem is the part where you store the link into a global variable. This is overwritten everytime you run through the loop. A better solution is storing the link into a List or leave the loop after first hit.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.