简体   繁体   English

如何使用Jsoup从html文件中获取特定数据?

[英]How to get specific data from a html file using Jsoup?

I am having a html file of a local language news paper and i want to collect all the words in the news paper which are in local language only 我有当地新闻报纸的html文件,我想收集新闻报纸中所有仅使用当地语言的词

I have observed in the html file that all the words in the local language are under a div element of class field-content so i have selected its element to get data but the div element is also containing elements like inside which the local language words exist 我在html文件中观察到,所有本地语言单词都在class field-content的div元素下,因此我选择了其元素来获取数据,但div元素还包含诸如本地语言单词存在的元素

<div class = "field-content"></div>

so how to get only the words of the local language from the html file 所以如何从html文件中仅获取本地语言的单词

url of the site: http://www.andhrabhoomi.net/ 该网站的网址: http : //www.andhrabhoomi.net/

my code: 我的代码:

public static void main(String a[])
        {
            Document doc;
            try {
                 doc = Jsoup.connect("http://www.andhrabhoomi.net/").userAgent("Mozilla").get();
                 String title = doc.title();

                 System.out.println("title : " + title);

                    // get all links
                    //Elements links = doc.select("a[href]");

                    Elements body = doc.select("div.field-content");

                    for (Element link : body) {

                        System.out.println(link);


    // get the value from href attribute
                        //System.out.println("\nlink : " + link.attr("href"));
                        //System.out.println("text : " + link.text());
                    }

            }catch(IOException e){
                System.out.println("error\n");

            }
        }

Not sure what you are after here, but if my guess is right this should help. 不知道这里的情况,但是如果我的猜测是正确的,这应该会有所帮助。 If not, just say so and we'll go from there. 如果没有,那么就说吧,我们将从那里继续。

You'll want to change your selection by getting just any classes that have field-content and then to get rid of all the other HTML content, you'll want to add text() onto the end of your System.out.println( link.text() ); 您将希望通过仅获取具有field-content任何类来更改选择,然后摆脱所有其他HTML内容,将text()添加到System.out.println( link.text() ); See below. 见下文。

Elements body = doc.getElementsByClass( "field-content" );

for( Element link : body )
{
    System.out.println( link.text() );
}

The solution is : 解决方案是:

        String title = doc.title();

        System.out.println("title : " + title);

        //get all links
        //Elements links = doc.select("a[href]");
        //Elements body = doc.select("div.field-content");
        Elements body = doc.select("div[class=\"field-content\"] > a");

        for (Element link : body) {

            System.out.println("---------------------------------------------------------------------------------------------------------------");
            System.out.println(link);

            Elements img = link.select("img");
            // get the value from href attribute
            System.out.print("\nSrc Img : " + img.attr("src"));

            Elements tag_a = link.select("a");
            System.out.println("\nHref : " + tag_a.attr("href"));
            //System.out.println("text : " + tag_a.text());
        }

    } catch (Exception e) {
        System.out.println("error\n");

    }
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM