Java URL库，用于在网站上抓行

Question

I want to be able to grab N lines (HTML text content that start on new lines) on a specific URL eg www.sitename.com and store them as strings in an array. 我希望能够在特定URL例如www.sitename.com ）上获取N行（以新行开头的HTML文本内容）并将它们作为strings存储在数组中。

something like 就像是

public void grabLines(){

//create instance of class from imported library

//pass sitename into it

//from the instance, call a method for grabbing the lines on the site and pass in "N" as a parameter

//the  method returns an array/list of N Strings that I can access later

}

Is there a native Java library I can import to do this? 是否可以导入本地Java库来执行此操作？ Does it allow me do what I want easily? 它可以让我轻松完成自己想做的事情吗？

Thanks 谢谢

Answer 1

Are you trying to make a screen scraper? 您是否要制作刮板机？ you will be pulling html as opposed to just what you see. 您将获取html而不是看到的内容。 also if the website is dynamic you won't be able to pull everything that you can see. 此外，如果网站是动态的，您将无法提取所有可见内容。 If you want just html and stuff you can try something like this. 如果您只想要html之类的东西，可以尝试这样的事情。 I tried to build a bloomberg screen scraper and then parse out the random html tags. 我试图构建一个Bloomberg屏幕抓取工具，然后解析出随机的html标签。

 try {
        URL bbg = new URL("http://www.bloomberg.com/markets/economic-calendar/");
        BufferedReader r =  new BufferedReader(new InputStreamReader( bbg.openStream()));
        while( (temp = r.readLine())!= null){
            System.out.println(temp);
        }

    } catch (Exception e){
        e.printStackTrace();
    }

Answer 2

Apache HttpClient是上述URL / Reader技术之上的抽象，但是类似： Apache HTTP Client

Java URL库，用于在网站上抓行

问题描述

2 个解决方案

解决方案1
2 已采纳 2011-06-25 18:53:37

解决方案2
1 2011-06-25 18:58:49

Java URL库，用于在网站上抓行

问题描述

2 个解决方案

解决方案1 2 已采纳 2011-06-25 18:53:37

解决方案2 1 2011-06-25 18:58:49

解决方案1
2 已采纳 2011-06-25 18:53:37

解决方案2
1 2011-06-25 18:58:49