I am trying to parse an MS Office 2003 document using antiword on my Linux server. But it will not parse cyrillic text correctly.
It returns something like this:
??? ???? ???????????
Does anybody know of a way to correctly parse a MS Office 2003 document that contains cyrillic text?
I resolved this issue with cyrillic text
good documentation you may see here
working code follows:
$content = shell_exec('/usr/bin/antiword -m cp1251.txt '.$filename);
var_dump($content);
Pay attention to param -m ( character mapping file)
you forgot to set correct mapping file
Piece of doc concerns mapping file:
Q9: Which mapping file (-m option) is correct in my situation?
A9: The correct mapping file depends on the character set you need for output
in a specific language.
For Western European languages (like English, French, German) this is
8859-1.txt. (OS/2: cp1252.txt) (DOS: cp850.txt)
For Eastern European languages (like Polish, Czech, Slovak, Croatian) this
is 8859-2.txt. (OS/2: cp1250.txt) (DOS: cp852.txt)
For Esperanto use 8859-3.txt.
For Russian use 8859-5.txt or koi8-r.txt. (OS/2: cp1251.txt)
(DOS: cp866.txt)
For Ukrainian use koi8-u.txt.
For Arabic use 8859-6.txt. (DOS: cp864.txt)
For Hebrew use 8859-8.txt. (DOS: cp862.txt)
For Thai use 8859-11.txt.
If your system supports it, you might also try UTF-8.txt.
NOTE: UTF-8 also enables Antiword to show text in languages like Chinese,
Japanese and Korean.
Antiword has an encoding parameter, maybe you give that a try:
shell_exec('antiword -X UTF-8 test.doc')
Or use koi8-r
, and then convert in php via iconv()
Alternatively try LibreOffice in cmdline mode
shell_exec('soffice --headless --convert-to txt test.doc')
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.