简体   繁体   English

在Java中以编程方式将Word doc转换为HTML

[英]Convert Word doc to HTML programmatically in Java

I need to convert a Word document into HTML file(s) in Java. 我需要将Word文档转换为Java中的HTML文件。 The function will take input an word document and the output will be html file(s) based on the number of pages the word document has ie if the word document has 3 pages then there will be 3 html files generated having the required page break. 该函数将输入一个word文档,输出将是基于word文档具有的页数的html文件,即如果word文档有3个页面,则将生成具有所需分页符的3个html文件。

I searched for open source/non-commercial APIs which can convert doc to html but for no result. 我搜索了开源/非商业API,可以将doc转换为html,但没有结果。 Anybody who have done this type of job before please help. 任何做过这种工作的人都请帮忙。

Thanks 谢谢

I recommend the JODConverter , It leverages OpenOffice.org, which provides arguably the best import/export filters for OpenDocument and Microsoft Office formats available today. 我推荐使用JODConverter ,它利用OpenOffice.org,它可以为今天提供的OpenDocument和Microsoft Office格式提供最好的导入/导出过滤器。

JODConverter has a lot of documents, scripts, and tutorials to help you out. JODConverter有很多文档,脚本和教程可以帮助你。

I've used the following approach successfully in production systems where the new MS Word XML format isn't available: 我在新的MS Word XML格式不可用的生产系统中成功使用了以下方法:

Spawn a process that does something similar to: 产生一个类似于以下内容的进程:

http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html http://www.oooninja.com/2008/02/batch-command-line-file-conversion-with.html

You'd probably want to start openoffice up once at startup of your program, and call the python script as many times during your program that you need to (with some sort of checking to ensure the ooffice process is always there). 您可能希望在程序启动时启动一次openoffice,并在程序期间多次调用python脚本(需要进行某种检查以确保ooffice进程始终存在)。

The other option is to spawn the following sort of command every time you need to do the conversion: 另一个选项是每次需要进行转换时生成以下类型的命令:

ooffice -headless "macro://<path to ooffice vb macro to convert, with parameter pointing to file>" ooffice -headless“macro:// <要转换的ooffice vb宏的路径,参数指向文件>”

I've used the macro approach multiple times and it works well (sorry, I don't have the macro code available). 我已经多次使用宏方法并且运行良好(抱歉,我没有可用的宏代码)。

While there are mechanisms for doing it via MS Word, they're not easy from Java, and do require other support programs to drive MS Word via OLE. 虽然有通过MS Word进行此操作的机制,但它们并不容易从Java中获取,并且需要其他支持程序来通过OLE驱动MS Word。

I've used abiword before too, which works well for many documents, but does get confused with more complex documents (ooffice seems to handle everything I've thrown at it). 我之前也使用过abiword,它适用于许多文档,但确实与更复杂的文档相混淆(ooffice似乎处理了我抛出的所有内容)。 Abiword has a slightly easier command line interface for conversion than ooffice. Abiword有一个比ooffice更容易转换的命令行界面。

We use tm-extractors ( http://mvnrepository.com/artifact/org.textmining/tm-extractors ), and fall back to the commercial Aspose ( http://www.aspose.com/ ). 我们使用tm-extractors( http://mvnrepository.com/artifact/org.textmining/tm-extractors ),然后回到商业Aspose( http://www.aspose.com/ )。 Both have native Java APIs. 两者都有本机Java API。

It is easier to do this in the new MS word docx as the format is in XML. 在新的MS word docx中更容易实现,因为格式是XML格式。 You can use an XSL to transform the Word doc in XML format to an HTML format. 您可以使用XSL将XML格式的Word文档转换为HTML格式。

If however your Word doc is in an old version, you can use POI library http://poi.apache.org/ and then access that and generate a Java object and from that point on you can easily convert it to an HTML format using an HTML java library 但是,如果您的Word文档是旧版本,则可以使用POI库http://poi.apache.org/然后访问它并生成Java对象,从那时起您可以轻松地将其转换为HTML格式一个HTML java库

http://www.dom4j.org/dom4j-1.4/apidocs/org/dom4j/io/HTMLWriter.html http://www.dom4j.org/dom4j-1.4/apidocs/org/dom4j/io/HTMLWriter.html

I see this thread turns up in external links and has the occasional post so I thought I'd post an update (hope no one minds). 我看到这个帖子出现在外部链接中并偶尔发帖,所以我想我会发布一个更新(希望没有人介意)。 OpenOffice continues to evolve and release 3.2 improves the word import export filters again. OpenOffice继续发展,版本3.2再次改进了单词导入导出过滤器。 OpenOffice and Java can run on many platforms so Java systems can make use of the OpenOffice UNO API directly to import/manipulate/export documents in many formats (including word and pdf) or use a library like JODReports or Docmosis to facilitate. OpenOffice和Java可以在许多平台上运行,因此Java系统可以直接使用OpenOffice UNO API来导入/操作/导出多种格式的文档(包括word和pdf),或者使用像JODReportsDocmosis这样的库来实现。 Both have free/open options. 两者都有免费/开放选项。

I tried this way and its work with me from this site http://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML 我尝试过这种方式,并通过此网站与我合作http://code.google.com/p/xdocreport/wiki/XWPFConverterXHTML

This only work with docx to convert it into html included images inside that word document. 这只适用于docx将其转换为html包含在该word文档中的图像。

    // 1) Load DOCX into XWPFDocument
    InputStream doc = new FileInputStream(new File("c:/document.docx"));
    XWPFDocument document = new XWPFDocument(doc);

   // 2) Prepare XHTML options (here we set the IURIResolver to load images from a "word/media" folder)
            XHTMLOptions options = XHTMLOptions.create(); //.URIResolver(new FileURIResolver(new File("word/media")));;

            // 3) Extract image
            String root = "target";
            File imageFolder = new File( root + "/images/" + doc );
            options.setExtractor( new FileImageExtractor( imageFolder ) );
            // 4) URI resolver
            options.URIResolver( new FileURIResolver( imageFolder ) );


            OutputStream out = new FileOutputStream(new File("c:/document.html"));
            XHTMLConverter.getInstance().convert(document, out, options);

I hope this solve your issue 我希望这能解决你的问题

If its a docx, you could use docx4j (ASL v2). 如果是docx,您可以使用docx4j (ASL v2)。 This uses XSLT to create the HTML. 这使用XSLT来创建HTML。

However, it will give you a single HTML for the whole document. 但是,它将为您提供整个文档的单个HTML。

If you wanted an HTML per page, you could do something with the lastRenderedPageBreak tag that Word puts into the docx (assuming you used Word to create it). 如果你想要每页HTML,你可以使用Word放入docx的lastRenderedPageBreak标签(假设你使用Word来创建它)。

You'd have to find the MS word doc specification ( since it is basically a binary dump of whatever is in word at that point in time ), and slowly go through it element by element converting ms word "objects/states" to the html equiv. 你必须找到MS word doc规范(因为它基本上是那个时间点的任何单词的二进制转储),并慢慢地逐个元素地将ms字“objects / states”转换为html当量。 you might be able to find a script to do it for u since this really isn't fun work and i'd advise against it ( converting file formats or even reading from commercial files on your own is always hard and often incomplete ). 你可能能够找到一个脚本为你做这个,因为这真的不是有趣的工作,我建议反对它(转换文件格式甚至自己阅读商业文件总是很难,往往不完整)。 PS: just google doc2html PS:只是google doc2html

If you are targeting word 2007 files using the ooxml format then this article might help. 如果您使用ooxml格式定位word 2007文件,那么本文可能会有所帮助。 And there is the Ooxml4j project which is implementing ooxml for Java library. 还有Ooxml4j项目正在为Java库实现ooxml。

If you are targeting the binary files though...thats another problem. 如果你的目标是二进制文件,那就是另一个问题。

import officetools.OfficeFile; // package available at www.dancrintea.ro/doc-to-pdf/
...
FileInputStream fis = new FileInputStream(new File("test.doc"));
FileOutputStream fos = new FileOutputStream(new File("test.html"));
OfficeFile f = new OfficeFile(fis,"localhost","8100", true);
f.convert(fos,"html");

All possible conversions: 所有可能的转换:

doc --> pdf, html, txt, rtf doc - > pdf,html,txt,rtf

xls --> pdf, html, csv xls - > pdf,html,csv

ppt --> pdf, swf ppt - > pdf,swf

html --> pdf html - > pdf

you can use micrsoft office online 你可以在线使用micrsoft office

first, on server side request https://view.officeapps.live.com/op/view.aspx?src= 'your doc file online url' 首先,在服务器端请求https://view.officeapps.live.com/op/view.aspx?src='the your doc file online url'

then use jsoup parse the result html 然后使用jsoup解析结果html

when access from mobile the html will have a frame wrapped. 当从移动设备访问时,html将包含一个框架。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM