有些人帮助用Java抓取页面

Question

I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it. 我需要使用Java来抓取一个网页，我已经读过正则表达式是一种非常低效的方法，我应该把它放到DOM文档中进行导航。

I've tried reading the documentation but it seems too extensive and I don't know where to begin. 我试过阅读文档，但它似乎太广泛，我不知道从哪里开始。

Could you show me how to scrape this table in to an array? 你能告诉我如何把这张桌子刮成阵列吗？ I can try figuring out my way from there. 我可以尝试从那里找出方法。 A snippet/example would do just fine too. 一个片段/示例也可以。

Thanks. 谢谢。

Answer 1

You can try jsoup: Java HTML Parser . 您可以尝试jsoup：Java HTML Parser 。 It is an excellent library with good sample codes. 这是一个很好的图书馆，有很好的示例代码。

Answer 2

Transform the web page you are trying to scrap into an XHTML document. 将您尝试废弃的网页转换为XHTML文档。 There are several options to do this with Java, such as JTidy and HTMLCleaner . 使用Java有几种方法可以做到这一点，例如JTidy和HTMLCleaner 。 These tools will also automatically fix malformed HTML (eg, close unclosed tags). 这些工具还将自动修复格式错误的HTML（例如，关闭未关闭的标签）。 Both work very well, but I prefer JTidy because it integrates better with Java's DOM API; 两者都工作得很好，但我更喜欢JTidy，因为它更好地与Java的DOM API集成;
Extract required information using XPath expressions. 使用XPath表达式提取所需信息。

Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table. 下面是一个使用JTidy和您提供的Web页面的工作示例，用于从表中提取所有文件名。

public static void main(String[] args) throws Exception {
    // Create a new JTidy instance and set options
    Tidy tidy = new Tidy();
    tidy.setXHTML(true); 

    // Parse an HTML page into a DOM document
    URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");        
    Document doc = tidy.parseDOM(url.openStream(), System.out);

    // Use XPath to obtain whatever you want from the (X)HTML
    XPath xpath = XPathFactory.newInstance().newXPath();
    XPathExpression expr = xpath.compile("//td[@valign = 'top']/a/text()");
    NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
    List<String> filenames = new ArrayList<String>();
    for (int i = 0; i < nodes.getLength(); i++) {
        filenames.add(nodes.item(i).getNodeValue()); 
    }

    System.out.println(filenames);
}

The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected. 结果将是[Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:]如预期的那样。

Another cool tool that you can use is Web Harvest . 您可以使用的另一个很酷的工具是Web Harvest 。 It basically does everything I did above but using an XML file to configure the extraction pipeline. 它基本上完成了我上面所做的一切，但使用XML文件来配置提取管道。

Answer 3

Regex is definitely the way to go. 正则表达式绝对是最佳选择。 Building a DOM is overly complicated and itself requires a lot of text parsing. 构建DOM过于复杂，本身需要大量的文本解析。

Answer 4

If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. 如果您正在做的就是将表格刮入数据文件，那么正则表达式就可以了，甚至可能比使用DOM文档更好。 DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents. DOM文档将占用大量内存（特别是对于非常大的数据表），因此您可能需要一个用于大型文档的SAX解析器。

有些人帮助用Java抓取页面

问题描述

4 个解决方案

解决方案1
7 已采纳 2011-01-02 04:54:08

解决方案2
4 2011-01-02 02:39:52

解决方案3
0 2011-01-02 02:39:20

解决方案4
0 2011-01-02 02:40:22

有些人帮助用Java抓取页面

问题描述

4 个解决方案

解决方案1 7 已采纳 2011-01-02 04:54:08

解决方案2 4 2011-01-02 02:39:52

解决方案3 0 2011-01-02 02:39:20

解决方案4 0 2011-01-02 02:40:22

解决方案1
7 已采纳 2011-01-02 04:54:08

解决方案2
4 2011-01-02 02:39:52

解决方案3
0 2011-01-02 02:39:20

解决方案4
0 2011-01-02 02:40:22