如何使用Jsoup从html文件中获取特定数据？

Question

我有当地新闻报纸的html文件，我想收集新闻报纸中所有仅使用当地语言的词

我在html文件中观察到，所有本地语言单词都在class field-content的div元素下，因此我选择了其元素来获取数据，但div元素还包含诸如本地语言单词存在的元素

<div class = "field-content"></div>

所以如何从html文件中仅获取本地语言的单词

该网站的网址： http : //www.andhrabhoomi.net/

我的代码：

public static void main(String a[])
        {
            Document doc;
            try {
                 doc = Jsoup.connect("http://www.andhrabhoomi.net/").userAgent("Mozilla").get();
                 String title = doc.title();

                 System.out.println("title : " + title);

                    // get all links
                    //Elements links = doc.select("a[href]");

                    Elements body = doc.select("div.field-content");

                    for (Element link : body) {

                        System.out.println(link);


    // get the value from href attribute
                        //System.out.println("\nlink : " + link.attr("href"));
                        //System.out.println("text : " + link.text());
                    }

            }catch(IOException e){
                System.out.println("error\n");

            }
        }

Answer 1

不知道这里的情况，但是如果我的猜测是正确的，这应该会有所帮助。 如果没有，那么就说吧，我们将从那里继续。

您将希望通过仅获取具有field-content任何类来更改选择，然后摆脱所有其他HTML内容，将text()添加到System.out.println( link.text() ); 见下文。

Elements body = doc.getElementsByClass( "field-content" );

for( Element link : body )
{
    System.out.println( link.text() );
}

Answer 2

解决方案是：

        String title = doc.title();

        System.out.println("title : " + title);

        //get all links
        //Elements links = doc.select("a[href]");
        //Elements body = doc.select("div.field-content");
        Elements body = doc.select("div[class=\"field-content\"] > a");

        for (Element link : body) {

            System.out.println("---------------------------------------------------------------------------------------------------------------");
            System.out.println(link);

            Elements img = link.select("img");
            // get the value from href attribute
            System.out.print("\nSrc Img : " + img.attr("src"));

            Elements tag_a = link.select("a");
            System.out.println("\nHref : " + tag_a.attr("href"));
            //System.out.println("text : " + tag_a.text());
        }

    } catch (Exception e) {
        System.out.println("error\n");

    }
}

如何使用Jsoup从html文件中获取特定数据？

问题描述

2 个解决方案

解决方案1
1 已采纳 2016-03-15 17:00:25

解决方案2
0 2016-03-16 09:02:47

如何使用Jsoup从html文件中获取特定数据？

问题描述

2 个解决方案

解决方案1 1 已采纳 2016-03-15 17:00:25

解决方案2 0 2016-03-16 09:02:47

解决方案1
1 已采纳 2016-03-15 17:00:25

解决方案2
0 2016-03-16 09:02:47