简体   繁体   English

有些人帮助用Java抓取页面

[英]Some help scraping a page in Java

I need to scrape a web page using Java and I've read that regex is a pretty inefficient way of doing it and one should put it into a DOM Document to navigate it. 我需要使用Java来抓取一个网页,我已经读过正则表达式是一种非常低效的方法,我应该把它放到DOM文档中进行导航。

I've tried reading the documentation but it seems too extensive and I don't know where to begin. 我试过阅读文档,但它似乎太广泛,我不知道从哪里开始。

Could you show me how to scrape this table in to an array? 你能告诉我如何把这张桌子刮成阵列吗? I can try figuring out my way from there. 我可以尝试从那里找出方法。 A snippet/example would do just fine too. 一个片段/示例也可以。

Thanks. 谢谢。

You can try jsoup: Java HTML Parser . 您可以尝试jsoup:Java HTML Parser It is an excellent library with good sample codes. 这是一个很好的图书馆,有很好的示例代码。

  1. Transform the web page you are trying to scrap into an XHTML document. 您尝试废弃的网页转换为XHTML文档。 There are several options to do this with Java, such as JTidy and HTMLCleaner . 使用Java有几种方法可以做到这一点,例如JTidyHTMLCleaner These tools will also automatically fix malformed HTML (eg, close unclosed tags). 这些工具还将自动修复格式错误的HTML(例如,关闭未关闭的标签)。 Both work very well, but I prefer JTidy because it integrates better with Java's DOM API; 两者都工作得很好,但我更喜欢JTidy,因为它更好地与Java的DOM API集成;
  2. Extract required information using XPath expressions. 使用XPath表达式提取所需信息。

Here is a working example using JTidy and the Web Page you provided, used to extract all file names from the table. 下面是一个使用JTidy和您提供的Web页面的工作示例,用于从表中提取所有文件名。

public static void main(String[] args) throws Exception {
    // Create a new JTidy instance and set options
    Tidy tidy = new Tidy();
    tidy.setXHTML(true); 

    // Parse an HTML page into a DOM document
    URL url = new URL("http://www.cs.grinnell.edu/~walker/fluency-book/labs/sample-table.html");        
    Document doc = tidy.parseDOM(url.openStream(), System.out);

    // Use XPath to obtain whatever you want from the (X)HTML
    XPath xpath = XPathFactory.newInstance().newXPath();
    XPathExpression expr = xpath.compile("//td[@valign = 'top']/a/text()");
    NodeList nodes = (NodeList)expr.evaluate(doc, XPathConstants.NODESET);
    List<String> filenames = new ArrayList<String>();
    for (int i = 0; i < nodes.getLength(); i++) {
        filenames.add(nodes.item(i).getNodeValue()); 
    }

    System.out.println(filenames);
}

The result will be [Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:] as expected. 结果将是[Integer Processing:, Image Processing:, A Photo Album:, Run-time Experiments:, More Run-time Experiments:]如预期的那样。

Another cool tool that you can use is Web Harvest . 您可以使用的另一个很酷的工具是Web Harvest It basically does everything I did above but using an XML file to configure the extraction pipeline. 它基本上完成了我上面所做的一切,但使用XML文件来配置提取管道。

Regex is definitely the way to go. 正则表达式绝对是最佳选择。 Building a DOM is overly complicated and itself requires a lot of text parsing. 构建DOM过于复杂,本身需要大量的文本解析。

If all you are doing is scraping a table into a datafile, regex will be just fine, and may be even better than using a DOM document. 如果您正在做的就是将表格刮入数据文件,那么正则表达式就可以了,甚至可能比使用DOM文档更好。 DOM documents will use up a lot of memory (especially for really large data tables) so you probably want a SAX parser for large documents. DOM文档将占用大量内存(特别是对于非常大的数据表),因此您可能需要一个用于大型文档的SAX解析器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM