简体   繁体   中英

What's the best way to extract table content from a group of HTML files?

使用TIDY清理完整的HTML文件的文件夹后,如何提取表格内容以进行进一步处理?

我过去使用过BeautifulSoup这样的东西取得了巨大的成功。

Depends on what sort of processing you want to do. You can tell Tidy to generate XHTML, which is a type of XML, which means you can use all the usual XML tools like XSLT and XQuery on the results.

If you want to process them in Microsoft Excel, then you should be able to slice the table out of the HTML and put it in a file, then open that file in Excel: it will happily convert an HTML table in to a spreadsheet page. You could then save it as CSV or as an Excel workbook etc. (You can even use this on a web server -- return an HTML table but set the Content-Type header to application/ms-vnd.excel : Excel will open and import the table and turn it in to a spreadsheet.)

If you want CSV to feed in to a database then you could go via Excel as before, or if you want to automate the process, you could write a program that uses the XML-navigating API of your choice to iterate of the table rows and save them as CSV. Python's Elementtree and CSV modules would make this pretty easy.

After reviewing the suggestions, I wound up using HtmlUnit .

With HtmlUnit, I was able to customize the Java code to open each HTML file in the folder, navigate to the TABLE tag, query each column content and extract the data I needed to create a CSV file.

iterate through the text and Use regular expression :)

http://www.knowledgehouse.sg

In .NET you could use HTMLAgilityPack .

See this previous question on StackOverflow for more information.

If you want to extract the content from the the HTML markup, you should use some type of HTML parser. To that end there are plenty out there and here are two that might suite your needs:

http://jtidy.sourceforge.net/
http://htmlparser.sourceforge.net/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM