简体繁体 English

从一组HTML文件中提取表格内容的最佳方法是什么？

[英]What's the best way to extract table content from a group of HTML files?

原文 2008-09-16 01:53:46 9 6 java/ html/ excel/ csv/ extract

使用TIDY清理完整的HTML文件的文件夹后，如何提取表格内容以进行进一步处理？

6 个解决方案

我过去使用过BeautifulSoup这样的东西取得了巨大的成功。

Depends on what sort of processing you want to do. 取决于您想要做什么样的处理。 You can tell Tidy to generate XHTML, which is a type of XML, which means you can use all the usual XML tools like XSLT and XQuery on the results. 您可以告诉Tidy生成XHTML，这是一种XML，这意味着您可以在结果上使用所有常用的XML工具，如XSLT和XQuery。

If you want to process them in Microsoft Excel, then you should be able to slice the table out of the HTML and put it in a file, then open that file in Excel: it will happily convert an HTML table in to a spreadsheet page. 如果要在Microsoft Excel中处理它们，那么您应该能够将表格从HTML中分割出来并放入文件中，然后在Excel中打开该文件：它会很乐意将HTML表格转换为电子表格页面。 You could then save it as CSV or as an Excel workbook etc. (You can even use this on a web server -- return an HTML table but set the Content-Type header to application/ms-vnd.excel : Excel will open and import the table and turn it in to a spreadsheet.) 然后，您可以将其保存为CSV或Excel工作簿等。（您甚至可以在Web服务器上使用它 - 返回HTML表，但将Content-Type标头设置为application/ms-vnd.excel ：Excel将打开并且导入表格并将其转入电子表格。）

If you want CSV to feed in to a database then you could go via Excel as before, or if you want to automate the process, you could write a program that uses the XML-navigating API of your choice to iterate of the table rows and save them as CSV. 如果您希望CSV输入到数据库，那么您可以像以前一样通过Excel，或者如果您想自动化该过程，您可以编写一个程序，使用您选择的XML导航API来迭代表行和将它们保存为CSV。 Python's Elementtree and CSV modules would make this pretty easy. Python的Elementtree和CSV模块可以让这很容易。

After reviewing the suggestions, I wound up using HtmlUnit . 在审核了这些建议之后，我结束了使用HtmlUnit 。

With HtmlUnit, I was able to customize the Java code to open each HTML file in the folder, navigate to the TABLE tag, query each column content and extract the data I needed to create a CSV file. 使用HtmlUnit，我能够自定义Java代码以打开文件夹中的每个HTML文件，导航到TABLE标记，查询每个列内容并提取创建CSV文件所需的数据。

iterate through the text and Use regular expression :) 遍历文本并使用正则表达式:)

http://www.knowledgehouse.sg http://www.knowledgehouse.sg

In .NET you could use HTMLAgilityPack . 在.NET中，您可以使用HTMLAgilityPack 。

See this previous question on StackOverflow for more information. 有关详细信息，请参阅StackOverflow上的上一个问题。

If you want to extract the content from the the HTML markup, you should use some type of HTML parser. 如果要从HTML标记中提取内容，则应使用某种类型的HTML解析器。 To that end there are plenty out there and here are two that might suite your needs: 为此目的有很多，这里有两个可能满足您的需求：

http://jtidy.sourceforge.net/ http://jtidy.sourceforge.net/
http://htmlparser.sourceforge.net/ http://htmlparser.sourceforge.net/