简体   繁体   English

查询URL并返回特定HTML ID的内容

[英]Query URL and return the contents of a specific HTML ID

I am looking to write a Java app which queries multiple URLs (defined by a list of URIs) for their HTML source and returns the contents of a specific element with a defined id on each page. 我希望编写一个Java应用程序,该应用程序查询其HTML源的多个URL(由URI列表定义),并在每个页面上返回具有已定义ID的特定元素的内容。

As an example, lets say one started with a list of a list of blog post URLs such as... 举例来说,假设某人以博客帖子网址列表(例如...)开头。

...now, if a sample page looks like the following... ...现在,如果示例页面如下所示...

<html>
<body>
    <div class="content">
        <h2 id="post_title">Post Title</h2>
        <p class="post_paragraph">Here is the content of my post.</p>
    </div>
</body>
</html>

How can I grab the contents of the "post_title" id for each of my URLs, and print it to the console with the classic System.out.print(String s)? 如何获取每个URL的“ post_title” ID的内容,并使用经典的System.out.print(String s)将其打印到控制台?

Thanks for all input. 感谢您的所有投入。

First you resolve the URL using Java's connection API 首先,您使用Java的连接API解析URL

http://download.oracle.com/javase/6/docs/api/java/net/URLConnection.html http://download.oracle.com/javase/6/docs/api/java/net/URLConnection.html

Then you will need to parse the HTML 然后,您将需要解析HTML

http://www.google.be/search?q=java+html+parser http://www.google.be/search?q=java+html+parser

And finally you will need to walk the parsed document structure (that will depend on the parser you choose) to find an element with the given id. 最后,您将需要遍历解析后的文档结构(这将取决于您选择的解析器)以找到具有给定id的元素。

There is included support in java to parse HTML. Java包含对解析HTML的支持。 Take a look at javax.swing.text.html.HTMLEditorKit : http://download.oracle.com/javase/6/docs/api/javax/swing/text/html/HTMLEditorKit.html 看看javax.swing.text.html.HTMLEditorKithttp : //download.oracle.com/javase/6/docs/api/javax/swing/text/html/HTMLEditorKit.html

A couple of examples of how to use it: 有关如何使用它的几个示例:

http://java.sun.com/products/jfc/tsc/articles/bookmarks/ http://java.sun.com/products/jfc/tsc/articles/bookmarks/

Development/ParseHTML.htm">http://www.java2s.com/Tutorial/Java/0120_Development/ParseHTML.htm Development / ParseHTML.htm“> http://www.java2s.com/Tutorial/Java/0120_Development/ParseHTML.htm

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM