[英]Some help scraping a page in Java
I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it. 我需要使用Java来抓取一个网页,我已经读过正则表达式是一种非常低效的方法,我应该把它放到DOM文档中进行导航。
I've tried reading the documentation but it seems too extensive and I don't know where to begin. 我试过阅读文档,但它似乎太广泛,我不知道从哪里开始。
Could you show me how to scrape this table in to an array? 你能告诉我如何把这张桌子刮成阵列吗? I can try figuring out my way from there.
我可以尝试从那里找出方法。 A snippet/example would do just fine too.
一个片段/示例也可以。
Thanks. 谢谢。
You can try jsoup: Java HTML Parser . 您可以尝试jsoup:Java HTML Parser 。 It is an excellent library with good sample codes.
这是一个很好的图书馆,有很好的示例代码。
Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table. 下面是一个使用JTidy和您提供的Web页面的工作示例,用于从表中提取所有文件名。
public static void main(String[] args) throws Exception {
// Create a new JTidy instance and set options
Tidy tidy = new Tidy();
tidy.setXHTML(true);
// Parse an HTML page into a DOM document
URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");
Document doc = tidy.parseDOM(url.openStream(), System.out);
// Use XPath to obtain whatever you want from the (X)HTML
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile("//td[@valign = 'top']/a/text()");
NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
List<String> filenames = new ArrayList<String>();
for (int i = 0; i < nodes.getLength(); i++) {
filenames.add(nodes.item(i).getNodeValue());
}
System.out.println(filenames);
}
The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:]
as expected. 结果将是
[Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:]
如预期的那样。
Another cool tool that you can use is Web Harvest
. 您可以使用的另一个很酷的工具是
Web Harvest
。 It basically does everything I did above but using an XML file to configure the extraction pipeline. 它基本上完成了我上面所做的一切,但使用XML文件来配置提取管道。
Regex is definitely the way to go. 正则表达式绝对是最佳选择。 Building a DOM is overly complicated and itself requires a lot of text parsing.
构建DOM过于复杂,本身需要大量的文本解析。
If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. 如果您正在做的就是将表格刮入数据文件,那么正则表达式就可以了,甚至可能比使用DOM文档更好。 DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents.
DOM文档将占用大量内存(特别是对于非常大的数据表),因此您可能需要一个用于大型文档的SAX解析器。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.