简体   繁体   English

Java URL库,用于在网站上抓行

[英]Java URL library for grabbing lines on a website

I want to be able to grab N lines (HTML text content that start on new lines) on a specific URL eg www.sitename.com and store them as strings in an array. 我希望能够在特定URL例如www.sitename.com )上获取N行(以新行开头的HTML文本内容)并将它们作为strings存储在数组中。

something like 就像是

public void grabLines(){

//create instance of class from imported library

//pass sitename into it

//from the instance, call a method for grabbing the lines on the site and pass in "N" as a parameter

//the  method returns an array/list of N Strings that I can access later

}

Is there a native Java library I can import to do this? 是否可以导入本地Java库来执行此操作? Does it allow me do what I want easily? 它可以让我轻松完成自己想做的事情吗?

Thanks 谢谢

Are you trying to make a screen scraper? 您是否要制作刮板机? you will be pulling html as opposed to just what you see. 您将获取html而不是看到的内容。 also if the website is dynamic you won't be able to pull everything that you can see. 此外,如果网站是动态的,您将无法提取所有可见内容。 If you want just html and stuff you can try something like this. 如果您只想要html之类的东西,可以尝试这样的事情。 I tried to build a bloomberg screen scraper and then parse out the random html tags. 我试图构建一个Bloomberg屏幕抓取工具,然后解析出随机的html标签。

 try {
        URL bbg = new URL("http://www.bloomberg.com/markets/economic-calendar/");
        BufferedReader r =  new BufferedReader(new InputStreamReader( bbg.openStream()));
        while( (temp = r.readLine())!= null){
            System.out.println(temp);
        }

    } catch (Exception e){
        e.printStackTrace();
    }

Apache HttpClient是上述URL / Reader技术之上的抽象,但是类似: Apache HTTP Client

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM