简体   繁体   中英

How to get specific data from a html file using Jsoup?

I am having a html file of a local language news paper and i want to collect all the words in the news paper which are in local language only

I have observed in the html file that all the words in the local language are under a div element of class field-content so i have selected its element to get data but the div element is also containing elements like inside which the local language words exist

<div class = "field-content"></div>

so how to get only the words of the local language from the html file

url of the site: http://www.andhrabhoomi.net/

my code:

public static void main(String a[])
        {
            Document doc;
            try {
                 doc = Jsoup.connect("http://www.andhrabhoomi.net/").userAgent("Mozilla").get();
                 String title = doc.title();

                 System.out.println("title : " + title);

                    // get all links
                    //Elements links = doc.select("a[href]");

                    Elements body = doc.select("div.field-content");

                    for (Element link : body) {

                        System.out.println(link);


    // get the value from href attribute
                        //System.out.println("\nlink : " + link.attr("href"));
                        //System.out.println("text : " + link.text());
                    }

            }catch(IOException e){
                System.out.println("error\n");

            }
        }

Not sure what you are after here, but if my guess is right this should help. If not, just say so and we'll go from there.

You'll want to change your selection by getting just any classes that have field-content and then to get rid of all the other HTML content, you'll want to add text() onto the end of your System.out.println( link.text() ); See below.

Elements body = doc.getElementsByClass( "field-content" );

for( Element link : body )
{
    System.out.println( link.text() );
}

The solution is :

        String title = doc.title();

        System.out.println("title : " + title);

        //get all links
        //Elements links = doc.select("a[href]");
        //Elements body = doc.select("div.field-content");
        Elements body = doc.select("div[class=\"field-content\"] > a");

        for (Element link : body) {

            System.out.println("---------------------------------------------------------------------------------------------------------------");
            System.out.println(link);

            Elements img = link.select("img");
            // get the value from href attribute
            System.out.print("\nSrc Img : " + img.attr("src"));

            Elements tag_a = link.select("a");
            System.out.println("\nHref : " + tag_a.attr("href"));
            //System.out.println("text : " + tag_a.text());
        }

    } catch (Exception e) {
        System.out.println("error\n");

    }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM