简体   繁体   English

从一组HTML文件中提取表格内容的最佳方法是什么?

[英]What's the best way to extract table content from a group of HTML files?

使用TIDY清理完整的HTML文件的文件夹后,如何提取表格内容以进行进一步处理?

我过去使用过BeautifulSoup这样的东西取得了巨大的成功。

Depends on what sort of processing you want to do. 取决于您想要做什么样的处理。 You can tell Tidy to generate XHTML, which is a type of XML, which means you can use all the usual XML tools like XSLT and XQuery on the results. 您可以告诉Tidy生成XHTML,这是一种XML,这意味着您可以在结果上使用所有常用的XML工具,如XSLT和XQuery。

If you want to process them in Microsoft Excel, then you should be able to slice the table out of the HTML and put it in a file, then open that file in Excel: it will happily convert an HTML table in to a spreadsheet page. 如果要在Microsoft Excel中处理它们,那么您应该能够将表格从HTML中分割出来并放入文件中,然后在Excel中打开该文件:它会很乐意将HTML表格转换为电子表格页面。 You could then save it as CSV or as an Excel workbook etc. (You can even use this on a web server -- return an HTML table but set the Content-Type header to application/ms-vnd.excel : Excel will open and import the table and turn it in to a spreadsheet.) 然后,您可以将其保存为CSV或Excel工作簿等。(您甚至可以在Web服务器上使用它 - 返回HTML表,但将Content-Type标头设置为application/ms-vnd.excel :Excel将打开并且导入表格并将其转入电子表格。)

If you want CSV to feed in to a database then you could go via Excel as before, or if you want to automate the process, you could write a program that uses the XML-navigating API of your choice to iterate of the table rows and save them as CSV. 如果您希望CSV输入到数据库,那么您可以像以前一样通过Excel,或者如果您想自动化该过程,您可以编写一个程序,使用您选择的XML导航API来迭代表行和将它们保存为CSV。 Python's Elementtree and CSV modules would make this pretty easy. Python的Elementtree和CSV模块可以让这很容易。

After reviewing the suggestions, I wound up using HtmlUnit . 在审核了这些建议之后,我结束了使用HtmlUnit

With HtmlUnit, I was able to customize the Java code to open each HTML file in the folder, navigate to the TABLE tag, query each column content and extract the data I needed to create a CSV file. 使用HtmlUnit,我能够自定义Java代码以打开文件夹中的每个HTML文件,导航到TABLE标记,查询每个列内容并提取创建CSV文件所需的数据。

iterate through the text and Use regular expression :) 遍历文本并使用正则表达式:)

http://www.knowledgehouse.sg http://www.knowledgehouse.sg

In .NET you could use HTMLAgilityPack . 在.NET中,您可以使用HTMLAgilityPack

See this previous question on StackOverflow for more information. 有关详细信息,请参阅StackOverflow上的上一个问题

If you want to extract the content from the the HTML markup, you should use some type of HTML parser. 如果要从HTML标记中提取内容,则应使用某种类型的HTML解析器。 To that end there are plenty out there and here are two that might suite your needs: 为此目的有很多,这里有两个可能满足您的需求:

http://jtidy.sourceforge.net/ http://jtidy.sourceforge.net/
http://htmlparser.sourceforge.net/ http://htmlparser.sourceforge.net/

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从博客文章中检测和提取文章内容/评论的最佳方法是什么 - What is the best way to detect and extract article content / comments from blog's article 从隐藏 html 元素的网站表格中获取信息的最佳方法是什么? - What is the best way to get information from this website's table where html elements are hidden? 从html页面提取元素的最佳方法? - best way to extract elements from a html page? 每天将1,000个(不同)内容文件上传到GAE Java Web应用程序的最佳方法是什么? - What's the best way to upload 1,000 (different) content files to a GAE Java web app every day? 如何从Java中的BufferedReader对象中提取整个内容的最佳方法是什么? - How is the best way to extract the entire content from a BufferedReader object in Java? 从 Java 中的字符串中提取第一个单词的最佳方法是什么? - What is the best way to extract the first word from a string in Java? 从给定的字符串中提取字符串的一部分的最佳方法是什么? - what is the best way to extract a part of a string from a given string? 从Java中的字符串中提取此int的最佳方法是什么? - What is the best way to extract this int from a string in Java? 从java程序中找出IBMi DB2400表的约束的最佳方法是什么? - What's the best way to find out the constraints on a IBMi DB2400 table from a java program? JPA-从数据库查找表加载静态数据的最佳实践是什么? - JPA - What's the best practice way to load static data from a Database lookup table?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM