简体   繁体   English

如何使用'catdoc'显示以UTF-8编码的Dock文件

[英]How to use 'catdoc' to display dock file encoded in utf-8

I have aa lot of docx files and I want to read them on terminal. 我有很多docx文件,我想在终端上阅读它们。 And I found catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/ 我发现catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/

When I use it, the output are just unreadable chars. 当我使用它时,输出只是不可读的字符。 My docx files are encoded in utf-8. 我的docx文件以utf-8编码。 I tried "catdoc -u my_file.docx" but does not work. 我尝试了“ catdoc -u my_file.docx”,但没有用。

Please help. 请帮忙。 Thank you very much. 非常感谢你。

docx are zipped XML files. docx是压缩的XML文件。

To extract and strip the XML try something based on 要提取和剥离XML,请尝试基于

unzip -p "*.docx" word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

from command line fu 从命令行fu

It is my naïve understanding that catdoc can generally only be used on DOC files. 我天真的理解, catdoc通常只能在DOC文件上使用。 DOCX files are something like a zipped container with a bunch of information in them; DOCX文件就像是一个压缩的容器,其中包含大量信息。 among which you can find the original document in some sort of XML format. 您可以在其中找到某种XML格式的原始文档。

Having said that, I have had pleasant success extracting the contents of DOCX files, or even DOTX files for that matter, using either doc2txt tool or the unoconv tool, the latter of which needs the OpenOffice or LibreOffice suite installed. 话虽如此,使用doc2txt工具或unoconv工具提取DOCX文件甚至DOTX文件的内容,我都取得了令人愉快的成功,后者需要安装OpenOffice或LibreOffice套件。

Here are some example workflows, which I have used successfully in the past: 以下是一些我过去成功使用过的工作流程示例:

# This one, contrary to the unoconv case, does not fire up an instance
# of either LibreOffice or OpenOffice.
docx2txt.pl < ./pesky-word-doc.docx > ./pesky-word-doc.txt

# This one, however, does fire up a rather heavy 'headless' OpenOffice
# or LibreOffice instance process per conversion. You can get around this
# using the next approach below.
unoconv -f txt -o ./pesky-word-doc.txt ./pesky-word-doc.docx

# If you need to convert a couple of dozens such documents, you might want
# to run it via a service port (you get the idea):
unoconv --listener --port=2002 &
unoconv -f txt -o outdir *.docx
unoconv -f pdf -o outdir *.docx && open ./outdir/*.pdf # Convenient, if you run MacOSX
kill -15 %-

# Kind of introducing catdoc: The sed was needed for German documents where
# somehow I couldn't find the proper encoding settings.
unoconv -f doc -o ./pesky-word-doc.doc ./pesky-word-doc.docx && \
          catdoc -u ./pesky-word-doc.doc | sed 's/ь/ü/g;s/д/ä/g;s/ц/ö/g'

There are other options, like using some of the available java parsers to be found here and here . 还有其他选项,例如使用一些可用的Java解析器,可在此处此处找到。 The output quality differs and depending on your intended usage requires you to go for either one of the approaches. 输出质量会有所不同,根据您的预期用途,您需要采用其中一种方法。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM