简体   繁体   English

将HTML表读入Java

[英]Read in html table to java

I need to pull data from an html page using Java code. 我需要使用Java代码从html页面中提取数据。 The java part is required. Java部分是必需的。

The page i am trying to pull info from is http://www.weather.gov/data/obhistory/KMCI.html . 我试图从中获取信息的页面是http://www.weather.gov/data/obhistory/KMCI.html

I need to create a list of hashmaps...or some kind of data object that i can reference in later code. 我需要创建一个哈希表列表...或某种我可以在以后的代码中引用的数据对象。

This is all i have so far: 到目前为止,这就是我所拥有的:

URL weatherDataKC = new URL("http://www.weather.gov/data/obhistory/KMCI.html");
InputStream is = weatherDataKC.openStream();
int cnt = 0;
StringBuffer buffer = new StringBuffer();

while ((cnt = is.read()) != -1){
    buffer.append((char) cnt);
}

System.out.print(buffer.toString());

Any suggestions where to start? 有什么建议从哪里开始?

there is a nice HTML parser called Neko: 有一个很好的HTML解析器叫做Neko:

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. NekoHTML是一个简单的HTML扫描器和标签平衡器,使应用程序程序员可以解析HTML文档并使用标准XML接口访问信息。 The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. 解析器可以扫描HTML文件并“修复”人类(和计算机)作者在编写HTML文档时犯的许多常见错误。 NekoHTML adds missing parent elements; NekoHTML添加缺少的父元素; automatically closes elements with optional end tags; 自动关闭带有可选结束标签的元素; and can handle mismatched inline element tags. 并可以处理不匹配的内联元素标签。

More information here . 更多信息在这里

使用像Cyber​​Neko这样的HTML解析器

J2SE includes HTML parsing capabilities, in packages javax.swing.text.html and javax.swing.text.html.parser . J2SE在javax.swing.text.htmljavax.swing.text.html.parser包中包括HTML解析功能。 HTMLEditorKit.ParserCallback receives events pushed by DocumentParser (better be used through ParserDelegator ). HTMLEditorKit.ParserCallback接收由DocumentParser推送的事件(最好通过ParserDelegator使用 )。 The framework is very similar to the SAX parsers for XML. 该框架与XML的SAX解析器非常相似。

Beware, there are some bugs. 当心,这里有一些错误。 It won't be able to handle bad HTML very well. 它不能很好地处理不良HTML。


Dealing with colspan and rowspan is your business. 处理colspan和rowpan是您的业务。

HTML scraping is notoriously difficult, unless you have a lot of "hooks" like unique IDs. 众所周知,HTML抓取非常困难,除非您有很多像唯一ID这样的“钩子”。 For example, the table you want starts with this HTML: 例如,您要使用的表以以下HTML开头:

<table cellspacing="3" cellpadding="2" border="0" width="670">

...which is very generic and may match several tables on the page. ...这是非常通用的,可能与页面上的多个表匹配。 The other problem is, what happens if the HTML structure changes? 另一个问题是,如果HTML结构发生变化,会发生什么? You'll have to redefine all your parsing rules... 您必须重新定义所有解析规则...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM