简体   繁体   中英

How to convert txt to pdf with utf-8?

I use the following command to convert txt to ps. Then convert ps to pdf.

enscript --header='Page $% of $=' --word-wrap -o output.ps 2>/dev/null < input.txt

But it does not work for utf-8 input.

enscript --header='Page $% of $=' --word-wrap -o output.ps 2>/dev/null <<< ℃

The above command results in â\\204\\203 in the output file.

I see discussions saying that enscript does not support utf-8. There seems to be several alternatives that convert txt to pdf. But it is not clear which one is the most robust and convenient to use. Does anybody know a best solution to this problem?

Tackling this as a programming question, and not a request for software recommendation (which would be off-topic).

You can't use UTF-8, or at least not simply. PostScript does not support UTF-8 directly at all. However....

Since PostScript is a programming language, you could write a program whcih examines the first byte of the UTF-8 sequence to see whether its a character code, or a code indicating further bytes. Essentially undoing the encoding to produce a Unicode code point.

From there, with a list of glyph names and Unicode code points, you could create a font with a custom Encoding, and instead of writing UTF-8 into the PostScript program, write the single byte which maps the character code through the Encoding to the relevant glyph name.

Or you could define a CIDFont, and then create a CMap which maps the variable length byte sequences of UTF-8 into CIDs to reference the correct glyph from the font. IIRC there are already UTF-16 CMaps around, in fact Adobe makes a number of them available here which also includes UTF-16 and UTF-32 versions for various CJKV languages.

Be aware that while these approaches will produce PostScript which renders correctly, and then can be used to create a PDF file which displays correctly, it will not be possible to copy/search the resulting PDF file.

In order to search a PDF file the font must have an associated ToUnicode CMap, this is a PDF-only construct, it does not exist in PostScript and there is no PostScript equivalent. So there's no way to embed that information in the PostScript program, which means it can't be embedded in the PDF file.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM